Data Science & Machine Learning
Join this channel to learn data science, artificial intelligence and machine learning with funny quizzes, interesting projects and amazing resources for free For collaborations: @love_data
显示更多📈 Telegram 频道 Data Science & Machine Learning 的分析概览
频道 Data Science & Machine Learning (@datasciencefun) 英语 语言赛道中的 是活跃参与者。目前社区聚集了 75 660 名订阅者,在 教育 类别中位列第 2 114,并在 印度 地区排名第 4 359 位。
📊 受众指标与增长动态
自 невідомо 创建以来,项目保持高速增长,吸引了 75 660 名订阅者。
根据 11 六月, 2026 的最新数据,频道保持稳定运转。过去 30 天订阅人数变化为 911,过去 24 小时变化为 29,整体触达仍然可观。
- 认证状态: 未认证
- 互动率 (ER): 平均受众互动率为 3.63%。内容发布后 24 小时内通常能获得 1.36% 的反应,占订阅者总量。
- 帖子覆盖: 每篇帖子平均可获得 2 747 次浏览,首日通常累积 1 032 次浏览。
- 互动与反馈: 受众积极参与,单帖平均反应数为 5。
- 主题关注点: 内容集中在 learning, accuracy, distribution, panda, dataset 等核心主题上。
📝 描述与内容策略
作者将该频道定位为表达主观观点的平台:
“Join this channel to learn data science, artificial intelligence and machine learning with funny quizzes, interesting projects and amazing resources for free
For collaborations: @love_data”
凭借高频更新(最新数据采集于 12 六月, 2026),频道始终保持新鲜度与高覆盖。分析显示受众积极互动,使其成为 教育 类别中的关键影响点。
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
Step 2. Load data
df = pd.read_csv("house_prices.csv")
df.head()
Step 3. Basic checks
df.shape
df.info()
df.isnull().sum()
Step 4. Data cleaning
Fill missing values.
df.fillna(df.median(numeric_only=True), inplace=True)
Step 5. Encode categorical variables
le = LabelEncoder()
for col in df.select_dtypes(include='object').columns:
df[col] = le.fit_transform(df[col])
Step 6. Feature scaling
scaler = StandardScaler()
X = df.drop('price', axis=1)
y = df['price']
X_scaled = scaler.fit_transform(X)
Step 7. Train test split
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y, test_size=0.3, random_state=42
)
Step 8. Build model
Linear Regression.
model = LinearRegression()
model.fit(X_train, y_train)
Step 9. Predictions
y_pred = model.predict(X_test)
Step 10. Evaluation
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print("MAE:", mae)
print("RMSE:", rmse)
print("R2:", r2)
Typical results
• R2 between 0.70 to 0.85
• Location and area dominate price
Step 11. Feature importance
importance = pd.DataFrame({
'Feature': X.columns,
'Coefficient': model.coef_
}).sort_values(by='Coefficient', ascending=False)
importance
Interpretation: Positive coefficient increases price. Negative reduces price.
Step 12. Model improvements
• Ridge regression for multicollinearity
• Lasso for feature selection
• Random Forest for non-linear patterns
Resume bullet example
• Built house price prediction model using regression
• Achieved R2 score above 0.8
• Identified key price drivers
Interview explanation flow
• Why RMSE matters
• How multicollinearity affects coefficients
• Why tree models outperform linear sometimes
Mini task for you
• Try Ridge and Lasso
• Compare RMSE
• Plot actual vs predicted
Double Tap ♥️ For Moreimport pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
nltk.download('stopwords')
Step 2. Load data
df = pd.read_csv("sentiment.csv")
df.head()
Example review: "The movie was amazing" sentiment: 1
Step 3. Basic checks
df.shape
df['sentiment'].value_counts()
Step 4. Text cleaning
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))
def clean_text(text):
text = text.lower()
text = re.sub('[^a-z]', ' ', text)
words = text.split()
words = [stemmer.stem(w) for w in words if w not in stop_words]
return ' '.join(words)
df['clean_review'] = df['review'].apply(clean_text)
Step 5. Train test split
X = df['clean_review']
y = df['sentiment']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y
)
Step 6. Text vectorization TF IDF
tfidf = TfidfVectorizer(max_features=5000)
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)
Why TF IDF
• Reduces common word weight
• Keeps meaningful words
Step 7. Model building
model = LogisticRegression(max_iter=1000)
model.fit(X_train_tfidf, y_train)
Step 8. Predictions
y_pred = model.predict(X_test_tfidf)
Step 9. Evaluation
accuracy_score(y_test, y_pred)
confusion_matrix(y_test, y_pred)
print(classification_report(y_test, y_pred))
Typical results
• Accuracy 85 to 90 percent
• Precision strong on positive reviews
• Neutral text harder to classify
Step 10. Test on custom text
sample = ["The product quality is terrible"]
sample_clean = [clean_text(sample[0])]
sample_vec = tfidf.transform(sample_clean)
model.predict(sample_vec)
Output: 0 negative
Common interview questions
• Why TF IDF over CountVectorizer
• How stopwords affect meaning
• Why Logistic Regression works well
Improvements
• Use n grams
• Try Naive Bayes
• Use LSTM or Transformers
Resume bullet example
• Built sentiment analysis model using TF IDF and Logistic Regression
• Achieved 88 percent accuracy on review data
• Automated text preprocessing pipeline
Mini task for you
• Add bigrams
• Compare Naive Bayes
• Plot ROC curve
Double Tap ♥️ For Moreimport pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
Step 2. Load data
df = pd.read_csv("ratings.csv")
df.head()
Example data
user_id | item_id | rating
1 | 101 | 5
1 | 102 | 3
Step 3. Create user item matrix
user_item_matrix = df.pivot_table(
index='user_id',
columns='item_id',
values='rating'
)
Matrix shape
Rows users
Columns items
Values ratings
Step 4. Handle missing values
user_item_matrix.fillna(0, inplace=True)
Why? Cosine similarity needs numbers.
Step 5. Compute user similarity
user_similarity = cosine_similarity(user_item_matrix)
user_similarity_df = pd.DataFrame(
user_similarity,
index=user_item_matrix.index,
columns=user_item_matrix.index
)
Step 6. Find similar users
user_id = 1
similar_users = user_similarity_df[user_id].sort_values(ascending=False)
similar_users.head()
Top result User itself score 1. Ignore it.
Step 7. Recommend items
Get items rated by similar users
similar_users = similar_users[similar_users.index != user_id]
weighted_ratings = user_item_matrix.loc[similar_users.index].T.dot(similar_users)
recommendations = weighted_ratings.sort_values(ascending=False)
Remove already rated items.
already_rated = user_item_matrix.loc[user_id]
already_rated = already_rated[already_rated > 0].index
recommendations = recommendations.drop(already_rated)
recommendations.head(5)
Output Top 5 recommended item IDs.
Step 8. Why cosine similarity
• Focuses on rating pattern
• Ignores scale differences
• Fast and simple
Limitations
• Cold start problem
• Sparse matrix
• No item features
Improvements
• Item based filtering
• Matrix factorization
• Hybrid models
Resume bullet example
• Built recommendation system using collaborative filtering
• Used cosine similarity on user item matrix
• Generated personalized item recommendations
Interview explanation flow
• Difference between content based and collaborative
• Why sparsity hurts
• Cold start solutions
Mini task for you
• Convert to item based filtering
• Add minimum similarity threshold
• Evaluate using precision at K
Double Tap ♥️ For Moreimport pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_absolute_error, mean_squared_error
Step 2. Load Data
df = pd.read_csv("sales.csv")
df.head()
Step 3. Date Handling
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)
# Sort by date
df = df.sort_index()
Step 4. Visualize Sales Trend
plt.plot(df.index, df['Sales'])
plt.title("Sales over time")
plt.show()
What you observe:
- Trend
- Seasonality
- Sudden spikes
Step 5. Decompose Time Series
decomposition = seasonal_decompose(df['Sales'], model='additive')
decomposition.plot()
plt.show()
Insight
- Trend shows long-term growth
- Seasonality repeats yearly or monthly
Step 6. Train Test Split
Split by time.
train = df.iloc[:-12]
test = df.iloc[-12:]
Why Last 12 months simulate future.
Step 7. Build ARIMA Model
model = ARIMA(train['Sales'], order=(1,1,1))
model_fit = model.fit() # corrected from (link unavailable)
Order meaning
- p: autoregressive
- d: differencing
- q: moving average
Step 8. Forecast
forecast = model_fit.forecast(steps=12)
print(forecast)
Step 9. Plot Forecast vs Actual
plt.plot(train.index, train['Sales'], label='Train')
plt.plot(test.index, test['Sales'], label='Actual')
plt.plot(test.index, forecast, label='Forecast')
plt.legend()
plt.show()
Step 10. Evaluation
mae = mean_absolute_error(test['Sales'], forecast)
rmse = np.sqrt(mean_squared_error(test['Sales'], forecast))
print("MAE:", mae)
print("RMSE:", rmse)
Typical results:
- RMSE depends on scale
- Trend captured well
- Peaks harder to predict
Step 11. Business Interpretation
- Underforecast leads to stockouts
- Overforecast leads to inventory waste
- Accuracy matters near peaks
Model Improvement Ideas
- SARIMA for seasonality
- Prophet for business calendars
- Add promotions and holidays
Resume Bullet Example
- Built time series model to forecast monthly sales
- Used ARIMA with rolling time-based split
- Reduced forecasting error using trend analysis
Interview Explanation Flow
- Why random split fails
- Importance of seasonality
- Error metrics selection
Mini Task for You
- Try SARIMA
- Forecast next 24 months
- Compare RMSE across models
Double Tap ♥️ For Moreimport pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score
Step 2. Load data
df = pd.read_csv("creditcard.csv")
df.head()
Step 3. Basic checks
df.shape
df['Class'].value_counts()
Output example:
• Genuine 284315
• Fraud 492
Step 4. Data understanding
Check class imbalance:
sns.countplot(x='Class', data=df)
plt.show()
Insight Highly imbalanced dataset.
Step 5. Feature scaling
Scale Amount column:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df['Amount'] = scaler.fit_transform(df[['Amount']])
Drop Time.python
df.drop('Time', axis=1, inplace=True)
Step 6. Split features and target
X = df.drop('Class', axis=1)
y = df['Class']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y
)
Step 7. Baseline model
Logistic Regression with class weight:
model = LogisticRegression(
max_iter=1000, class_weight='balanced'
)
model.fit(X_train, y_train)
Why class_weight
• Penalizes fraud mistakes more
• Improves recall
Step 8. Predictions
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:,1]
Step 9. Evaluation
Confusion matrix:
confusion_matrix(y_test, y_pred)Classification report:
print(classification_report(y_test, y_pred))
ROC AUC:
roc_auc_score(y_test, y_prob)
Typical results
• Accuracy looks high but ignored
• Fraud recall improves sharply
• ROC AUC around 0.97
Step 10. Threshold tuning
Increase fraud recall:
y_pred_custom = (y_prob > 0.3).astype(int) confusion_matrix(y_test, y_pred_custom)Business logic Lower threshold catches more fraud. More false alerts accepted. Step 11. Advanced approach Random Forest:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(
n_estimators=100, class_weight='balanced', random_state=42
)
rf.fit(X_train, y_train)
rf_prob = rf.predict_proba(X_test)[:,1]
roc_auc_score(y_test, rf_prob)
Resume bullet example
- Built fraud detection model on highly imbalanced data
- Improved fraud recall using class weighting and threshold tuning
- Evaluated model using ROC AUC instead of accuracy
Interview explanation flow
- Explain imbalance problem
- Why accuracy fails
- Why recall matters
- How threshold changes business impact
Mini task for you
- Apply SMOTE
- Compare with Isolation Forest
- Plot Precision Recall curve
Double Tap ♥️ For Moreimport pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
Step 2. Load data
df = pd.read_csv("customer_churn.csv")
df.head()
Step 3. Basic checks
df.shape
df.info()
df.isnull().sum()
Step 4. Data cleaning
Convert TotalCharges to numeric.
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df['TotalCharges'].fillna(df['TotalCharges'].median(), inplace=True)
Drop customer ID.
df.drop('customerID', axis=1, inplace=True)
Step 5. Exploratory Data Analysis
Churn distribution.
sns.countplot(x='Churn', data=df)
plt.show()
Tenure vs churn.
sns.boxplot(x='Churn', y='tenure', data=df)
plt.show()
Common insights:
• Month-to-month contracts churn more
• Low tenure users churn early
• High monthly charges increase churn
Step 6. Encode categorical variables
le = LabelEncoder()
for col in df.select_dtypes(include='object').columns:
df[col] = le.fit_transform(df[col])
Step 7. Feature scaling
scaler = StandardScaler()
num_cols = ['tenure', 'MonthlyCharges', 'TotalCharges']
df[num_cols] = scaler.fit_transform(df[num_cols])
Step 8. Split data
X = df.drop('Churn', axis=1)
y = df['Churn']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y
)
Step 9. Build model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
Step 10. Predictions
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:,1]
Step 11. Evaluation
confusion_matrix(y_test, y_pred)
print(classification_report(y_test, y_pred))
roc_auc_score(y_test, y_prob)
Typical results:
• Accuracy around 78 to 83 percent
• ROC AUC around 0.84
• Recall for churn is key metric
Step 12. Business actions from model
• Target high-risk users
• Offer discounts to month-to-month users
• Push yearly contracts
• Improve onboarding for first 90 days
Resume bullet example:
• Built churn prediction model using Logistic Regression
• Identified contract type and tenure as top churn drivers
• Improved churn recall using class-aware split
Interview explanation flow:
• Revenue loss problem
• Why recall matters more than accuracy
• How features map to actions
Mini task for you:
• Train Random Forest
• Compare ROC AUC
• Tune threshold for higher recall
Double Tap ♥️ For Part-3
现已上线!2025 年 Telegram 研究 — 年度关键洞察 
