Data science/ML/AI

Открыть в Telegram

Data science and machine learning hub Python, SQL, stats, ML, deep learning, projects, PDFs, roadmaps and AI resources. For beginners, data scientists and ML engineers 👉 https://rebrand.ly/bigdatachannels DMCA: @disclosure_bds Contact: @mldatascientist

Больше

Сеть:Programming, data science, ML - free courses by Big Data Specialist Индия31 635 Технологии и приложения9 377...

📈 Аналитический обзор Telegram-канала Data science/ML/AI

Канал Data science/ML/AI (@datascience_bds) языкового сегмента Английский является активным участником. Сейчас сообщество объединяет 13 674 подписчиков, занимая 9 377 место в категории Технологии и приложения и 31 635 место в регионе Индия.

📊 Показатели аудитории и динамика

С момента создания невідомо проект демонстрирует стремительный рост, собрав аудиторию из 13 674 подписчиков.

Согласно последним данным от 09 июня, 2026, канал показывает стабильную активность. За последние 30 дней изменение числа участников составило 155, а за последние 24 часа — 5, при этом общий охват остаётся высоким.

Статус верификации: Не верифицирован
Уровень вовлечённости (ER): Средний показатель вовлечённости аудитории составляет 8.03%. В первые 24 часа после публикации контент обычно набирает 2.25% реакций от общего числа подписчиков.
Охват публикаций: В среднем каждый пост получает 1 098 просмотров. В течение первых суток публикация набирает 308 просмотров.
Реакции и взаимодействия: Аудитория активно поддерживает контент: среднее количество реакций на один пост — 5.
Тематические интересы: Контент сосредоточен на ключевых темах, таких как panda, learning, row, api, ethic.

📝 Описание и контентная политика

Автор описывает ресурс как площадку для выражения субъективного мнения:
“Data science and machine learning hub Python, SQL, stats, ML, deep learning, projects, PDFs, roadmaps and AI resources. For beginners, data scientists and ML engineers 👉 https://rebrand.ly/bigdatachannels DMCA: @disclosure_bds Contact: @mldatasci...”

Благодаря высокой частоте обновлений (последние данные получены 10 июня, 2026) канал поддерживает актуальность и высокий уровень охвата публикаций. Аналитика показывает, что аудитория активно взаимодействует с контентом, что делает его важной точкой влияния в категории Технологии и приложения.

13 674

Подписчики

+524 часа

+197 дней

+15530 день

1 098

Просмотры поста

~ 30824 часа

~ 45248 часов

8.03%

Коэффициент вовлеченности

~ 1

Постов в день

Ads index

beta

Архив постов

13 674

▎Common Machine Learning Terms 1. Algorithm: A set of rules or steps used to solve a problem or perform a task, particularly in the context of data processing and analysis. 2. Model: A mathematical representation of a real-world process, created by training an algorithm on data. 3. Training Data: The dataset used to train a machine learning model, consisting of input-output pairs. 4. Test Data: A separate dataset used to evaluate the performance of a trained model, ensuring it generalizes well to unseen data. 5. Overfitting: A modeling error that occurs when a model learns the training data too well, capturing noise along with the underlying pattern, leading to poor performance on new data. 6. Underfitting: A situation where a model is too simple to capture the underlying trend in the data, resulting in poor performance on both training and test datasets. 7. Feature: An individual measurable property or characteristic of the data used as input for a model. 8. Label: The output or target variable that a model aims to predict based on the input features. 9. Supervised Learning: A type of machine learning where the model is trained on labeled data, learning to map inputs to outputs. 10. Unsupervised Learning: A type of machine learning where the model is trained on unlabeled data, aiming to find patterns or groupings within the data. 11. Reinforcement Learning: A type of machine learning where an agent learns to make decisions by taking actions in an environment to maximize cumulative reward. 12. Hyperparameters: Configuration settings used to control the training process of a model, which are set before training begins. 13. Loss Function: A mathematical function that quantifies how well a model's predictions match the actual outcomes; used to guide the optimization process. 14. Gradient Descent: An optimization algorithm used to minimize the loss function by iteratively adjusting model parameters in the direction of the steepest descent. 15. Cross-Validation: A technique for assessing how the results of a model will generalize by dividing the dataset into multiple subsets and training/testing across them. 16. Confusion Matrix: A table used to evaluate the performance of a classification model by comparing predicted labels against actual labels. 17. Precision and Recall: Metrics used to evaluate classification models; precision measures the accuracy of positive predictions, while recall measures the ability to find all relevant instances. 18. ROC Curve (Receiver Operating Characteristic Curve): A graphical representation of a model's diagnostic ability across various threshold settings, plotting true positive rates against false positive rates. 19. Regularization: Techniques used to prevent overfitting by adding a penalty for complexity to the loss function (e.g., L1 and L2 regularization). 20. Ensemble Learning: Combining multiple models to improve overall performance; common methods include bagging, boosting, and stacking.

13 674

Repost from Programming Quiz Channel

Which ML technique is used to improve model performance by combining multiple models?

Anonymous voting

13 674

Repost from Data science research papers

TradingAgents: Multi-Agents LLM Financial Trading Framework 📅 Publication Date: Dec 28, 2024 📑 Paper: https://arxiv.org/pdf/2412.20138 🔗 Code: https://github.com/tauricresearch/tradingagents 🚀 Spaces citing this paper: • https://huggingface.co/spaces/shanghengdu/LLM-Agent-Optimization-PaperList • https://huggingface.co/spaces/tahp0604/ai-stock-watchlist 📝 Description: The paper introduces TradingAgents, a multi-agent framework that utilizes large language models for stock trading, simulating the collaborative dynamics of real-world trading firms. The framework consists of various agents, including fundamental analysts, sentiment analysts, technical analysts, and traders with different risk profiles, all powered by large language models. These agents work together to assess market conditions, manage risk, and make informed trading decisions. The framework also includes researcher agents that evaluate market conditions and a risk management team that monitors exposure.

13 674

What is an Agenetic AI?

13 674

🧠 The Statistical Illusion: Simpson’s Paradox 🎭 Imagine you are choosing a hospital for a surgery. • Hospital A has a higher survival rate than Hospital B for "Easy" cases. • Hospital A also has a higher survival rate for "Hard" cases. Common sense says: Choose Hospital A. But when you look at the total combined data, Hospital B actually has a higher survival rate. 🤯 This is Simpson’s Paradox: A trend appears in several different groups but disappears or reverses when these groups are combined. 🔍 Why does this happen? It happens because of a Lurking Variable (a hidden factor). In this case, Hospital A is a world-class facility, so it takes on way more "Hard" cases than Hospital B. Even though they are better at both types, the high volume of risky surgeries drags their overall average down. 🐍 See the Paradox in Code Let's simulate this "impossible" scenario using Python:

import pandas as pd

# Data: [Successes, Total Attempts]
data = {
    'Hospital': ['A', 'A', 'B', 'B'],
    'Case_Type': ['Easy', 'Hard', 'Easy', 'Hard'],
    'Survived': [95, 10, 90, 70], 
    'Total': [100, 100, 100, 1000] 
}
df = pd.DataFrame(data)

# 1. Check rates per group
df['Rate'] = df['Survived'] / df['Total']
print("--- Rates by Group ---")
print(df[['Hospital', 'Case_Type', 'Rate']])

# 2. Check overall rates
overall = df.groupby('Hospital').sum()
overall['Overall_Rate'] = overall['Survived'] / overall['Total']
print("\n--- Overall Rates (The Paradox!) ---")
print(overall['Overall_Rate'])

The Result: • A is better at Easy (95% vs 90%). • A is better at Hard (10% vs 7%). • BUT... Overall, B wins (14% vs 52%) because B mostly did "Easy" cases. 🛠 How to avoid being fooled? 1. Don't trust the aggregate: When analyzing data, always try to "segment" or "drill down" into sub-groups. 2. Look for the Weight: Ask yourself: "Is one group disproportionately represented in the total?" 3. Identify the Lurking Variable: What context is missing? (e.g., Age, Severity, Time of Day). 🎯 The Takeaway In Data Science, the "Big Picture" can sometimes be a big lie. If your analysis produces a result that defies logic, you might be looking at a Simpson’s Paradox. Always slice your data before you trust it.

13 674

Hey everyone 👋 I know I promised to create a Data Science course. I was working on that late last year, but since early 2026 I’ve had some health issues, so they got postponed. I’ll get back to them as soon as I’m better 🙌 In the meantime, I launched this ☝️ today: https://learndevs.com/ I started building this back in the 2020s, together with many of you. It’s not perfect yet, but better to have it now than wait forever. Would love your feedback ❤️

13 674

Repost from Programming, data science, ML - free courses by Big Data Specialist

We’re live 🚀 After 4 years of work, I finally launched: 👉 learndevs.com

Goal: one place for everything a developer needs (free courses, tech news, job offers, manually written blogs. best github repos etc)

A lot of you contributed by writing code or adding courses and knowledge along the way. This is as much yours as it is mine 🙌 And I’m already working on: • Personalized roadmaps • Live chat • Better job search & placement Try it and please tell me: What would you add next? Reminder that if you want early access to new features, Join our beta testers group. Looking for people who will explore, break things, and share honest feedback.

13 674

Repost from Programming Quiz Channel

Which trade-off is common in database indexing?

Anonymous voting

13 674

Heart of Data Science

13 674

🗺 The 5 W's of Data Visualization: Why, Who, What, When, Where Creating a chart is easy. Creating a good chart, one that actually communicates an insight and isn't just a pretty picture, requires thinking like a detective. You need to answer the "5 W's" before you even pick a chart type. Every great visualization tells a story, and you need to know the plot points. 🤔 1. WHY: What is the Goal? Before you draw anything, ask: • What question am I trying to answer? (e.g., "How do sales change over time?", "Which region performs best?") • What insight do I want the viewer to gain? (e.g., "Sales are growing rapidly," "Region X is underperforming.") • What decision will this chart help make? (e.g., "Should we invest more in Region Y?") Your chart's purpose dictates everything from chart type to color choices. 👥 2. WHO: Who is the Audience? Consider who will be looking at your chart: • Technical Experts: Can handle complex plots, statistical jargon, and detailed axes. • Business Stakeholders: Need clear, high-level insights. Focus on the "so what?" Avoid jargon. • General Public: Keep it simple, use intuitive charts, and provide clear titles and labels. A chart for an AI researcher is vastly different from one for a marketing team. 📊 3. WHAT: What Data is Relevant? • What variables (columns) are needed? Don't include everything just because it's there. • What time frame or subset of data is required? (e.g., Q3 sales only, data for specific countries). • What are the units? ($, %, kg, units, etc.) – Crucial for labels! ⏰ 4. WHEN: When is the Data Important? This is about the time or sequence of your data: • Trends over time? (Line charts, area charts) • Comparisons at a specific point? (Bar charts, pie charts - use sparingly!) • Distribution within a period? (Histograms, box plots) • Relationships at any time? (Scatter plots) The "when" helps you choose the chart type that best shows change or static comparison. 🗺 5. WHERE: Where Does the Data Live? • Geographical Data: If your data is tied to locations (countries, states, cities), use maps! • Choropleth Maps: Color-coding regions based on a value. • Point Maps: Showing locations with markers. • Hierarchical Data: If your data has levels (e.g., Company > Department > Team), use treemaps or sunburst charts. 💡 The Golden Rule of Visualization: The chart should make the insight obvious, not require the viewer to dig for it. If you're not sure, ask someone from your target audience to look at it and tell you what they see. 🎯 What you should do ✔️ Clarify your chart's purpose (WHY). ✔️ Tailor your visuals to your audience (WHO). ✔️ Select only the necessary data (WHAT). ✔️ Choose chart types that reflect time/sequence (WHEN). ✔️ Use maps or hierarchical charts for spatial/structural data (WHERE).

13 674

▎Common MLOps Terms 1. MLOps: A set of practices that automates and standardizes the lifecycle of Machine Learning models, from experimentation and development to deployment and maintenance. 2. Model Training: The process of feeding data to an ML algorithm to learn patterns and make predictions, resulting in a trained model. 3. Feature Store: A centralized repository for storing, serving, and managing features for Machine Learning models, ensuring consistency between training and inference. 4. Data Versioning: The practice of tracking changes to datasets over time, ensuring reproducibility and allowing rollbacks to previous versions. 5. Model Versioning: Managing different iterations of a Machine Learning model, tracking changes, performance, and metadata. 6. Experiment Tracking: Recording all details of an ML experiment (code, hyperparameters, data, metrics) to compare results and ensure reproducibility. 7. Model Registry: A centralized hub to manage the lifecycle of ML models, including versioning, metadata, and status (e.g., "staging," "production"). 8. Model Deployment: The process of making a trained ML model available for predictions in a production environment, often via an API endpoint. 9. Inference: The process of using a deployed ML model to make predictions on new, unseen data. 10. Model Monitoring: Continuously tracking the performance, health, and behavior of deployed ML models to detect issues like data drift or performance degradation. 11. Continuous Training (CT): The practice of automatically retraining and updating ML models in production based on new data or performance metrics. 12. Reproducibility: The ability to achieve the same results (model, predictions) from an ML experiment given the same data, code, and environment. 13. Data Drift: A change in the distribution of input data to an ML model, which can cause performance degradation. 14. Concept Drift: A change in the underlying relationship between the input data and the target variable, leading to model inaccuracy over time. 15. Bias Detection: Identifying and mitigating unfair or discriminatory patterns in ML models or their data, ensuring ethical AI outcomes. 16. ML Pipeline: An automated workflow for running an ML task, encompassing data ingestion, feature engineering, model training, evaluation, and deployment steps. 17. Orchestration: Managing and coordinating the automated tasks within an ML pipeline to ensure they run in the correct sequence and handle dependencies. 18. Explainable AI (XAI): Tools and techniques that make the decisions and predictions of ML models understandable to humans. 19. Serving Infrastructure: The systems and platforms used to host and serve ML models in production, optimized for low-latency inference (e.g., REST APIs, specialized model servers). 20. ML Metadata Management: Storing and organizing information about ML artifacts (datasets, models, features, experiments) to provide lineage and ensure governance.

13 674

Repost from Programming Quiz Channel

Which metric is best for regression problems?

Anonymous voting

13 674

Software Engineer to AI Engineer: 2026 Practical Roadmap

13 674

Repost from Programming Quiz Channel

Which concept helps reduce variance in machine learning models?

Anonymous voting

13 674

📉 The Art of the Dashboard: Choosing the Right Chart Type 🖼 You have clean data, you've tested your hypotheses, and now you need to show your findings. But which chart do you use? A bar chart? A line chart? A pie chart (gulp)? Choosing the wrong chart can obscure your message or even mislead your audience. Choosing the right one makes your data sing. 1. To Show a Trend Over Time 📈 Best For: Seeing how something changes day-to-day, month-to-month, year-to-year. Chart Types: - Line Chart: Classic, great for continuous data. Shows direction. - Area Chart: Like a line chart, but the area under the line is filled. Good for showing total volume over time. - Bar Chart (Time Series): Use if you have discrete time periods (e.g., yearly sales) and want to compare exact values.

# Example Use Case: Monthly Website Traffic
# Chart: Line Chart

2. To Compare Categories 📊 Best For: Showing differences in size or value across distinct groups. Chart Types: - Bar Chart (Vertical/Column): Most common. Great for comparing quantities across groups. Easy to read exact values. - Bar Chart (Horizontal): Better when you have many categories or long category names. - Grouped Bar Chart: Compares sub-categories within main categories. - Stacked Bar Chart: Shows total for a category AND how it's made up of sub-categories.

# Example Use Case: Sales per Region
# Chart: Horizontal Bar Chart

3. To Show Composition (Part-to-Whole) 🍕 Best For: Displaying how a total is divided into parts. Use with caution! Chart Types: - Pie Chart: Only use if you have few categories (max 5-6) and you want to show proportions of a whole. The *largest* slice is easiest to read. - Donut Chart: Similar to pie, but the center is cut out (can sometimes display a total value). - Stacked Bar Chart (100%): Shows proportions across categories, but as bars, which are often easier to compare than pie slices.

# Example Use Case: Market Share (if only 3 companies)
# Chart: Pie Chart (if few companies) or 100% Stacked Bar

Warning: Humans are bad at comparing slice angles. Bar charts are usually better for precise comparisons. 4. To Show Relationships (Correlation) 🔗 Best For: Seeing if two numerical variables are connected and how strongly. Chart Types: - Scatter Plot: The go-to. Each dot is an observation, showing the values of two variables. Look for patterns (linear, curved, clusters). - Bubble Chart: A scatter plot where the size of the "bubble" (dot) represents a third numerical variable.

# Example Use Case: Does Experience correlate with Salary?
# Chart: Scatter Plot

5. To Show Distribution 📦 Best For: Understanding the range, spread, and central tendency of a single numerical variable. Chart Types: - Histogram: Shows frequency counts within bins (ranges) of your data. Great for spotting skewness or multi-modal distributions. - Box Plot (Whisker Plot): Shows median, quartiles, and potential outliers. Excellent for comparing distributions across categories.

# Example Use Case: Distribution of customer ages
# Chart: Histogram or Box Plot (if comparing age by product)

💡 The Ultimate Rule: Keep it simple. The chart should tell the story quickly. If your audience has to stare at it for five minutes to figure out what's going on, it's not working. 🎯 Today's Goal(What you should do) ✔️ Know which chart excels at showing trends vs. comparisons vs. relationships. ✔️ Use bar charts for categories and line charts for time. ✔️ Be very cautious with pie charts! ✔️ Use scatter plots to find connections.

13 674

📢 Advertising in this channel You can place an ad via Telega․io. It takes just a few minutes. Formats and current rates: View details

13 674

💎 5 Rare But High-Value Sites for Data Scientists If you’re tired of the same surface-level tutorials, these five "hidden gems" provide deep technical value you'll refer to for the rest of your career: 1️⃣ Deep Learning Drizzle A massive, curated database of free, high-quality university courses (Stanford, MIT, CMU) covering every niche in AI and ML. 🔗 https://deep-learning-drizzle.github.io/ 2️⃣ Distill pub It uses incredible interactive visualizations to explain complex machine learning research papers that are usually very hard to digest. 🔗 https://distill.pub/ 3️⃣ Connected Papers It creates a visual map of how academic papers are linked so you can find the "ancestors" of any specific algorithm. 🔗 https://www.connectedpapers.com/ 4️⃣ ML-Ops org While everyone focuses on building models, this site teaches you the "production" side how to actually deploy, monitor, and manage models in the real world. 🔗 https://ml-ops.org/ 5️⃣ Explained ai Provides the most intuitive, deep-dive explanations on the internet for how specific algorithms (like Random Forests or Gradient Boosting) actually work under the hood. 🔗 https://explained.ai/ Save these for your next deep-work session! 🚀

13 674

Power BI Dax Formulas Handbook.pdf0.12 KB

13 674

⚖️ Hypothesis Testing & P-values 🧑‍⚖️📊 You've run an A/B test. Your new website design (Version B) got 12% more clicks than the old one (Version A). Great, right? But is that 12% a real improvement, or just a lucky fluctuation in your data? This is where Hypothesis Testing and the notorious P-value come in. They help you decide if your observed data is significant enough to make a big decision, or if it's just random chance. 🏛 The Courtroom Scenario Imagine a trial: • Default Assumption (Null Hypothesis, H0): The defendant is NOT GUILTY. (Our designs are the same, the 12% is luck.) • What We're Trying to Prove (Alternative Hypothesis, H1): The defendant IS GUILTY. (Version B is better than A.) • The Evidence (Your Data): The 12% difference in clicks. • The Judge's Decision (P-value): How likely is it that we'd see this "evidence" (12% difference) if the defendant were truly not guilty (designs truly the same)? 1. The Null (H0) & Alternative (H1) Hypotheses • Null Hypothesis (H0): There is no significant difference between the two groups/variables. (e.g., "New design has no effect on clicks." or "Mean sales for region X is 100.") • Alternative Hypothesis (H1): There is a significant difference or relationship. (e.g., "New design increases clicks." or "Mean sales for region X is not 100.") Our goal is usually to reject H0 in favor of H1. 2. The P-value: What It Actually Means The P-value is the probability of observing data as extreme as (or more extreme than) your current data, assuming the Null Hypothesis is true. • Small P-value (e.g., 0.01): "It's highly unlikely we'd see this much difference if the new design had no effect. So, we'll reject the null and conclude the new design is better." • Large P-value (e.g., 0.60): "There's a good chance we'd see this difference just by luck, even if the new design had no real effect. So, we fail to reject the null." 3. The Significance Level (Alpha, α) This is your cutoff point. Most commonly, α = 0.05 (5%). • P-value ≤ α: Reject the Null Hypothesis. (Your result is "statistically significant.") • P-value > α: Fail to Reject the Null Hypothesis. (Your result is not statistically significant.) 4. The Biggest Misconceptions (DON'T DO THIS!) • P-value is NOT the probability that H0 is true. • P-value is NOT the probability that H1 is false. • A "significant" P-value doesn't mean the effect is large or important in the real world. (A tiny, unimportant difference can be statistically significant if you have a huge dataset.) 🎯 Today's Goal(What you should do) ✔️ Formulate clear Null and Alternative Hypotheses. ✔️ Understand the P-value as the likelihood of seeing your data if the Null were true. ✔️ Use a significance level (alpha) to make decisions. ✔️ AVOID common P-value misinterpretations! 👉 P-values don't tell you if your hypothesis is true, but they do tell you if your data makes the Null Hypothesis look very, very unlikely.

13 674

How to Choose Your ML Research Topic: Step by Step Framework