Data science/ML/AI

前往频道在 Telegram

Data science and machine learning hub Python, SQL, stats, ML, deep learning, projects, PDFs, roadmaps and AI resources. For beginners, data scientists and ML engineers 👉 https://rebrand.ly/bigdatachannels DMCA: @disclosure_bds Contact: @mldatascientist

显示更多

网络:Programming, data science, ML - free courses by Big Data Specialist 印度31 771 技术与应用9 387...

📈 Telegram 频道 Data science/ML/AI 的分析概览

频道 Data science/ML/AI (@datascience_bds) 英语语言赛道中的是活跃参与者。目前社区聚集了 13 664 名订阅者，在 技术与应用 类别中位列第 9 387，并在印度地区排名第 31 771 位。

📊 受众指标与增长动态

自 невідомо 创建以来，项目保持高速增长，吸引了 13 664 名订阅者。

根据 05 六月, 2026 的最新数据，频道保持稳定运转。过去 30 天订阅人数变化为 171，过去 24 小时变化为 1，整体触达仍然可观。

认证状态： 未认证
互动率 (ER)： 平均受众互动率为 7.95%。内容发布后 24 小时内通常能获得 2.46% 的反应，占订阅者总量。
帖子覆盖： 每篇帖子平均可获得 1 086 次浏览，首日通常累积 336 次浏览。
互动与反馈： 受众积极参与，单帖平均反应数为 5。
主题关注点： 内容集中在 panda, learning, row, api, ethic 等核心主题上。

📝 描述与内容策略

作者将该频道定位为表达主观观点的平台：
“Data science and machine learning hub Python, SQL, stats, ML, deep learning, projects, PDFs, roadmaps and AI resources. For beginners, data scientists and ML engineers 👉 https://rebrand.ly/bigdatachannels DMCA: @disclosure_bds Contact: @mldatasci...”

凭借高频更新（最新数据采集于 06 六月, 2026），频道始终保持新鲜度与高覆盖。分析显示受众积极互动，使其成为 技术与应用 类别中的关键影响点。

13 664

订阅者

+124 小时

+597 天

+17130 天

1 086

帖子浏览量

~ 33624 小时

~ 49948 小时

7.95%

参与率

~ 1

每日帖子数

Ads index

beta

帖子存档

13 664

▎Common Deep Learning Terms 1. Neural Network: A computational model inspired by the human brain, consisting of interconnected nodes (neurons) organized in layers. 2. Layer: A collection of neurons that process input data in a neural network; common types include input layers, hidden layers, and output layers. 3. Activation Function: A mathematical function applied to the output of each neuron, introducing non-linearity into the model; common examples include ReLU, sigmoid, and tanh. 4. Forward Propagation: The process of passing input data through the network to obtain an output prediction. 5. Backpropagation: An algorithm used to update the weights of a neural network by calculating the gradient of the loss function with respect to each weight. 6. Epoch: One complete pass through the entire training dataset during the training process. 7. Batch Size: The number of training examples used in one iteration of model training; affects memory usage and training speed. 8. Learning Rate: A hyperparameter that controls how much to change the model's weights during training based on the gradient of the loss function. 9. Dropout: A regularization technique that randomly sets a fraction of neurons to zero during training to prevent overfitting. 10. Convolutional Neural Network (CNN): A specialized type of neural network designed for processing grid-like data, such as images, using convolutional layers. 11. Recurrent Neural Network (RNN): A type of neural network designed for sequential data, allowing information to persist across time steps; often used in natural language processing. 12. Long Short-Term Memory (LSTM): A specific type of RNN architecture that can learn long-term dependencies by using memory cells and gates. 13. Generative Adversarial Network (GAN): A framework consisting of two neural networks (generator and discriminator) that compete against each other to generate new data samples. 14. Transfer Learning: A technique where a pre-trained model is fine-tuned on a new, often smaller dataset to leverage learned features. 15. Loss Function: A measure of how well the model's predictions match the actual outcomes; commonly used functions include mean squared error and categorical cross-entropy. 16. Optimizer: An algorithm used to adjust the weights of a neural network during training to minimize the loss function; examples include Adam, SGD, and RMSprop. 17. Gradient Descent: An optimization algorithm used to minimize the loss function by iteratively updating model parameters in the direction of the steepest descent. 18. Overfitting: A modeling error that occurs when a neural network learns noise and details from the training data too well, resulting in poor performance on unseen data. 19. Underfitting: A situation where a neural network fails to capture the underlying trend in the training data, leading to poor performance on both training and test datasets. 20. Data Augmentation: Techniques used to artificially increase the size of a training dataset by creating modified versions of existing data points (e.g., rotating, flipping images).

13 664

Data Warehouse vs Data Lake vs Lake House vs Mesh

13 664

Repost from Programming, data science, ML - free courses by Big Data Specialist

Mastering AI Agents.pdf1.57 MB

13 664

▎Data Visualization: The Art of Turning Numbers into Stories Imagine you’re at a party, and someone starts talking about how many people prefer pizza over tacos. They could throw out a bunch of numbers, and you might nod politely, but your eyes would probably glaze over. Now, picture them pulling out a vibrant pie chart that slices up the preferences in colorful segments. Suddenly, it’s not just numbers; it’s a story! You can see who loves pizza and who’s all about those tacos at a glance. ▎Why Data Visualization Rocks 1. Instant Understanding: Humans are visual creatures. Our brains process images 60,000 times faster than text! A well-designed graph can convey complex information quickly and clearly. It’s like giving your audience a cheat sheet to the data. 2. Spotting Trends and Patterns: Ever tried to read a spreadsheet with thousands of rows? Yikes! But with a line graph, you can easily spot trends over time like that steady rise in your friend's pizza sales during the summer. 🍕📈 3. Engagement: A captivating visual grabs attention and keeps people interested. Think of infographics or interactive dashboards, they’re like the cool kids of the data world, making everyone want to join the conversation. 4. Decision-Making: Good visuals help stakeholders make informed decisions. Instead of drowning in data, they can look at a bar chart comparing sales across regions and see where to focus their efforts. ▎Tools of the Trade There are some pretty awesome tools out there to create stunning visuals: • Tableau: This is like the Swiss Army knife of data visualization. It’s user-friendly and lets you create interactive dashboards without needing to code. • Matplotlib Seaborn (Python): If you’re into coding, these libraries let you craft beautiful graphs right from your Python scripts. Perfect for those who love to get hands-on with their data! • D3.js: For web developers, D3.js is a JavaScript library that brings data to life using HTML, SVG, and CSS. You can create anything from simple charts to complex interactive graphics. ▎A Quick Example Let’s say you want to visualize your weekly coffee consumption (because who doesn’t love coffee?). Instead of just listing out numbers, you could create a bar chart showing how many cups you drink each day:

import matplotlib.pyplot as plt

# Days of the week
days = ['Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun']
# Coffee cups consumed
cups = [2, 3, 4, 1, 5, 6, 3]

plt.bar(days, cups, color='brown')
plt.title('Weekly Coffee Consumption')
plt.xlabel('Days')
plt.ylabel('Cups of Coffee')
plt.show()

With this simple code, you’ve transformed boring numbers into a visual that tells a story about your caffeine habits! ▎Conclusion Data visualization isn’t just about making pretty pictures; it’s about making data accessible and understandable. It helps you tell stories that resonate with your audience and empowers them to make decisions based on insights rather than just raw numbers. So next time you have data to share, think about how you can visualize it, your audience will thank you!

13 664

Linear Algebra for Data Science.pdf6.12 KB

13 664

▎Machine Learning Basics Machine Learning (ML) is a subset of artificial intelligence that enables systems to learn from data and improve their performance over time without being explicitly programmed. ▎Core Concepts 1. Types of Machine Learning: – Supervised Learning: The model is trained on labeled data (input-output pairs). Common algorithms include: ▪️ Linear Regression ▪️ Decision Trees ▪️ Support Vector Machines (SVM) – Unsupervised Learning: The model works with unlabeled data to find patterns or groupings. Common algorithms include: ▪️ K-Means Clustering ▪️ Hierarchical Clustering ▪️ Principal Component Analysis (PCA) – Reinforcement Learning: The model learns by interacting with an environment and receiving feedback in the form of rewards or penalties. 2. Key Components: – Features: Individual measurable properties or characteristics used as input for the model. – Labels: The output variable that the model aims to predict (in supervised learning). – Training Data: The dataset used to train the model. – Test Data: A separate dataset used to evaluate the model's performance. ▎Machine Learning Workflow 1. Data Collection: Gather relevant data from various sources. 2. Data Preprocessing: Clean and prepare the data for analysis, including handling missing values and normalizing features. 3. Model Selection: Choose an appropriate algorithm based on the problem type. 4. Training: Fit the model to the training data. 5. Evaluation: Assess the model's performance using metrics like accuracy, precision, recall, and F1-score. 6. Hyperparameter Tuning: Optimize the model's parameters to improve performance. 7. Deployment: Implement the model in a real-world application. ▎Example: Supervised Learning with Scikit-Learn Here's a simple example using Python's scikit-learn library to perform linear regression:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression

# Sample data
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 3, 5, 7, 11])

# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Create and train the model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions
predictions = model.predict(X_test)

# Plot results
plt.scatter(X, y, color='blue', label='Data Points')
plt.plot(X_test, predictions, color='red', label='Predicted Line')
plt.legend()
plt.show()

13 664

How LLMs Work: A Step by Step Explanation

13 664

▎Common Generative AI Terms 1. Generative AI: A type of artificial intelligence that can create new content, such as text, images, music, code, or videos, based on patterns learned from existing data. 2. Large Language Model (LLM): A deep learning model trained on massive amounts of text data, capable of understanding, generating, and manipulating human language. Examples: GPT-3.5, GPT-4, ChatGPT, Claude. 3. Tokens: The basic units of text that LLMs process. They can be words, sub-word units, or punctuation, and are used to break down input text. 4. Context Window: The maximum number of tokens an LLM can consider at once when processing input and generating output. A larger context window allows for longer conversations and more complex prompts. 5. Prompt: The input text or instructions given to a Generative AI model to elicit a specific response or output. 6. Prompt Engineering: The art and science of crafting effective prompts to guide Generative AI models to produce desired outputs, optimizing for accuracy, relevance, and style. 7. Zero-Shot Prompting: Asking an LLM to perform a task it hasn't been explicitly trained on, relying on its general knowledge and understanding of language. 8. Few-Shot Prompting: Providing an LLM with a few examples of input-output pairs within the prompt itself to demonstrate the desired task and improve performance. 9. Chain-of-Thought (CoT) Prompting: Encouraging an LLM to generate step-by-step reasoning before arriving at a final answer, improving performance on complex tasks. 10. Temperature: A parameter that controls the randomness of an LLM's output. Higher temperatures lead to more creative but potentially less coherent responses, while lower temperatures yield more focused and deterministic outputs. 11. Hallucination: When a Generative AI model produces incorrect, nonsensical, or fabricated information that is presented as factual. 12. Fine-tuning: The process of further training a pre-trained LLM on a smaller, specific dataset to adapt it to a particular task or domain. 13. Retrieval Augmented Generation (RAG): A technique that enhances LLMs by retrieving relevant information from an external knowledge base before generating a response, grounding the AI in factual data. 14. Embeddings: Numerical representations (vectors) of text, images, or other data that capture semantic meaning, allowing AI models to understand relationships between different pieces of information. 15. Latent Space: An abstract, multi-dimensional space where Generative AI models represent and manipulate data. The process of generating content involves navigating this space. 16. Diffusion Models: A class of generative models, popular for image generation, that work by gradually adding noise to data and then learning to reverse the process to create new data. 17. Generative Adversarial Network (GAN): A framework consisting of two neural networks (a generator and a discriminator) that compete against each other to produce highly realistic synthetic data. 18. Multimodal AI: Generative AI models capable of understanding and generating content across multiple modalities, such as text, images, audio, and video. 19. Transformer Architecture: The foundational neural network architecture that powers most modern LLMs, known for its ability to process sequential data and capture long-range dependencies. 20. Content Moderation: Processes and tools used to ensure that AI-generated content adheres to safety guidelines, ethical standards, and legal requirements, preventing the creation of harmful or inappropriate material.

13 664

Data Pipeline Overview

13 664

🎭 The Deceiving Score: Accuracy vs. Precision/Recall (Imbalanced Data) 💡 Your model to detect a rare disease (1% prevalence) boasts 99% accuracy. Impressive? Not if it just says "NO DISEASE" to everyone! For imbalanced data, plain accuracy is a lie. 📈 The Problem: Imbalanced Data Many real-world cases (fraud, disease, ad clicks) have a tiny "positive" class. A model predicting the majority class (e.g., "no disease") will have high accuracy but be useless for finding the rare events you care about. 📊 Beyond Accuracy: The Confusion Matrix Break down predictions into: • True Positives (TP): Correctly found the positive. • True Negatives (TN): Correctly found the negative. • False Positives (FP): Wrongly said positive (costly "false alarms"). • False Negatives (FN): Wrongly said negative (costly "missed opportunities"). 🎯 The Right Metrics • Accuracy: (TP+TN) / Total - Avoid for imbalanced data! • Precision: TP / (TP + FP) • Meaning: Out of all times it said "Positive," how many were truly positive? • Use When: False Positives (FP) are very costly (e.g., wrongly flagging a healthy person as sick). • Recall: TP / (TP + FN) • Meaning: Out of all actual positives, how many did it catch? • Use When: False Negatives (FN) are very costly (e.g., missing a real fraud, not detecting a tumor). • F1-Score: Balances Precision and Recall. 🐍 Code Example: The 99% Accurate Lie

from sklearn.metrics import accuracy_score, precision_score, recall_score
import numpy as np

y_true = np.concatenate([np.zeros(990), np.ones(10)]) # 1000 samples, 1% positive

# Model 1: Always predicts '0' (no disease)
y_pred_bad = np.zeros(1000) 
print(f"Model 1 (Always No Disease):\n  Accuracy: {accuracy_score(y_true, y_pred_bad):.2f}")
print(f"  Precision: {precision_score(y_true, y_pred_bad, zero_division=0):.2f}") # 0.00!
print(f"  Recall: {recall_score(y_true, y_pred_bad):.2f}\n") # 0.00!

# Model 2: Catches 5 positives, 2 false alarms (Better!)
y_pred_better = np.zeros(1000)
y_pred_better[990:995] = 1 # 5 True Positives
y_pred_better[100:102] = 1 # 2 False Positives
print(f"Model 2 (Actually Catches Some):\n  Accuracy: {accuracy_score(y_true, y_pred_better):.2f}")
print(f"  Precision: {precision_score(y_true, y_pred_better, zero_division=0):.2f}") # 0.71
print(f"  Recall: {recall_score(y_true, y_pred_better):.2f}") # 0.50
# Model 2's accuracy might be slightly lower, but its Precision/Recall shows it's far superior!

🎯 Today's Goal (What you should do) ✔️ Recognize accuracy's flaw for imbalanced data. ✔️ Pick Precision when False Positives hurt most. ✔️ Pick Recall when False Negatives hurt most. ✔️ Understand what your model's mistakes truly cost.

13 664

Repost from Programming Quiz Channel

Which deep learning model is commonly used for sequential text data?

Anonymous voting

13 664

Support Vector Machine Notes 🗒️ .pdf8.57 MB

13 664

▎How to Enter Data Science 1. Master the Fundamentals Begin with the foundational skills by learning programming languages like Python and R, which are essential for data analysis and machine learning. Familiarize yourself with key libraries and tools such as Pandas, NumPy, scikit-learn, and TensorFlow for machine learning, as well as Tableau and Matplotlib for data visualization. Online courses, tutorials, and coding bootcamps can provide structured learning paths. 2. Identify Your Niche Data science spans various industries, including healthcare, finance, marketing, and technology. Explore these fields to determine where your interests lie. Understanding the specific challenges and data types in your chosen industry will help you tailor your learning and make you more effective in your future role. 3. Build a Strong Portfolio Start working on small projects that demonstrate your skills and knowledge. These could include data analysis tasks, machine learning models, or visualizations based on publicly available datasets. Use platforms like GitHub to showcase your work, and consider writing blog posts or creating presentations to explain your projects. A well-rounded portfolio not only highlights your technical capabilities but also reflects your problem-solving approach. 4. Engage with the Community Join data science communities online (like Kaggle, Stack Overflow, or LinkedIn groups) to connect with professionals in the field. Participating in discussions, attending webinars, and contributing to open-source projects can enhance your learning experience and expand your network. 5. Pursue Continuous Learning Data science is an ever-evolving field, so staying updated with the latest trends, techniques, and tools is crucial. Follow relevant blogs, podcasts, and research papers. Consider pursuing advanced certifications or degrees to deepen your expertise. 6. Gain Practical Experience Look for internships, volunteer opportunities, or part-time positions that allow you to apply your skills in real-world scenarios. Practical experience will not only reinforce your learning but also give you insights into the day-to-day responsibilities of a data scientist. By following these steps, you can build a solid foundation in data science and position yourself for success in this dynamic and rewarding field.

13 664

50 Data Science Project Ideas

13 664

▎Common Machine Learning Terms 1. Algorithm: A set of rules or steps used to solve a problem or perform a task, particularly in the context of data processing and analysis. 2. Model: A mathematical representation of a real-world process, created by training an algorithm on data. 3. Training Data: The dataset used to train a machine learning model, consisting of input-output pairs. 4. Test Data: A separate dataset used to evaluate the performance of a trained model, ensuring it generalizes well to unseen data. 5. Overfitting: A modeling error that occurs when a model learns the training data too well, capturing noise along with the underlying pattern, leading to poor performance on new data. 6. Underfitting: A situation where a model is too simple to capture the underlying trend in the data, resulting in poor performance on both training and test datasets. 7. Feature: An individual measurable property or characteristic of the data used as input for a model. 8. Label: The output or target variable that a model aims to predict based on the input features. 9. Supervised Learning: A type of machine learning where the model is trained on labeled data, learning to map inputs to outputs. 10. Unsupervised Learning: A type of machine learning where the model is trained on unlabeled data, aiming to find patterns or groupings within the data. 11. Reinforcement Learning: A type of machine learning where an agent learns to make decisions by taking actions in an environment to maximize cumulative reward. 12. Hyperparameters: Configuration settings used to control the training process of a model, which are set before training begins. 13. Loss Function: A mathematical function that quantifies how well a model's predictions match the actual outcomes; used to guide the optimization process. 14. Gradient Descent: An optimization algorithm used to minimize the loss function by iteratively adjusting model parameters in the direction of the steepest descent. 15. Cross-Validation: A technique for assessing how the results of a model will generalize by dividing the dataset into multiple subsets and training/testing across them. 16. Confusion Matrix: A table used to evaluate the performance of a classification model by comparing predicted labels against actual labels. 17. Precision and Recall: Metrics used to evaluate classification models; precision measures the accuracy of positive predictions, while recall measures the ability to find all relevant instances. 18. ROC Curve (Receiver Operating Characteristic Curve): A graphical representation of a model's diagnostic ability across various threshold settings, plotting true positive rates against false positive rates. 19. Regularization: Techniques used to prevent overfitting by adding a penalty for complexity to the loss function (e.g., L1 and L2 regularization). 20. Ensemble Learning: Combining multiple models to improve overall performance; common methods include bagging, boosting, and stacking.

13 664

Repost from Programming Quiz Channel

Which ML technique is used to improve model performance by combining multiple models?

Anonymous voting

13 664

Repost from Data science research papers

TradingAgents: Multi-Agents LLM Financial Trading Framework 📅 Publication Date: Dec 28, 2024 📑 Paper: https://arxiv.org/pdf/2412.20138 🔗 Code: https://github.com/tauricresearch/tradingagents 🚀 Spaces citing this paper: • https://huggingface.co/spaces/shanghengdu/LLM-Agent-Optimization-PaperList • https://huggingface.co/spaces/tahp0604/ai-stock-watchlist 📝 Description: The paper introduces TradingAgents, a multi-agent framework that utilizes large language models for stock trading, simulating the collaborative dynamics of real-world trading firms. The framework consists of various agents, including fundamental analysts, sentiment analysts, technical analysts, and traders with different risk profiles, all powered by large language models. These agents work together to assess market conditions, manage risk, and make informed trading decisions. The framework also includes researcher agents that evaluate market conditions and a risk management team that monitors exposure.

13 664

What is an Agenetic AI?

13 664

🧠 The Statistical Illusion: Simpson’s Paradox 🎭 Imagine you are choosing a hospital for a surgery. • Hospital A has a higher survival rate than Hospital B for "Easy" cases. • Hospital A also has a higher survival rate for "Hard" cases. Common sense says: Choose Hospital A. But when you look at the total combined data, Hospital B actually has a higher survival rate. 🤯 This is Simpson’s Paradox: A trend appears in several different groups but disappears or reverses when these groups are combined. 🔍 Why does this happen? It happens because of a Lurking Variable (a hidden factor). In this case, Hospital A is a world-class facility, so it takes on way more "Hard" cases than Hospital B. Even though they are better at both types, the high volume of risky surgeries drags their overall average down. 🐍 See the Paradox in Code Let's simulate this "impossible" scenario using Python:

import pandas as pd

# Data: [Successes, Total Attempts]
data = {
    'Hospital': ['A', 'A', 'B', 'B'],
    'Case_Type': ['Easy', 'Hard', 'Easy', 'Hard'],
    'Survived': [95, 10, 90, 70], 
    'Total': [100, 100, 100, 1000] 
}
df = pd.DataFrame(data)

# 1. Check rates per group
df['Rate'] = df['Survived'] / df['Total']
print("--- Rates by Group ---")
print(df[['Hospital', 'Case_Type', 'Rate']])

# 2. Check overall rates
overall = df.groupby('Hospital').sum()
overall['Overall_Rate'] = overall['Survived'] / overall['Total']
print("\n--- Overall Rates (The Paradox!) ---")
print(overall['Overall_Rate'])

The Result: • A is better at Easy (95% vs 90%). • A is better at Hard (10% vs 7%). • BUT... Overall, B wins (14% vs 52%) because B mostly did "Easy" cases. 🛠 How to avoid being fooled? 1. Don't trust the aggregate: When analyzing data, always try to "segment" or "drill down" into sub-groups. 2. Look for the Weight: Ask yourself: "Is one group disproportionately represented in the total?" 3. Identify the Lurking Variable: What context is missing? (e.g., Age, Severity, Time of Day). 🎯 The Takeaway In Data Science, the "Big Picture" can sometimes be a big lie. If your analysis produces a result that defies logic, you might be looking at a Simpson’s Paradox. Always slice your data before you trust it.

13 664

Hey everyone 👋 I know I promised to create a Data Science course. I was working on that late last year, but since early 2026 I’ve had some health issues, so they got postponed. I’ll get back to them as soon as I’m better 🙌 In the meantime, I launched this ☝️ today: https://learndevs.com/ I started building this back in the 2020s, together with many of you. It’s not perfect yet, but better to have it now than wait forever. Would love your feedback ❤️