Data Science & Machine Learning
Join this channel to learn data science, artificial intelligence and machine learning with funny quizzes, interesting projects and amazing resources for free For collaborations: @love_data
Показати більше📈 Аналітичний огляд Telegram-каналу Data Science & Machine Learning
Канал Data Science & Machine Learning (@datasciencefun) у мовному сегменті Англійська є активним учасником. На даний момент спільнота об'єднує 75 660 підписників, посідаючи 2 114 місце в категорії Освіта та 4 359 місце у регіоні Індія.
📊 Показники аудиторії та динаміка
З моменту свого створення невідомо, проект продемонстрував стрімке зростання, зібравши аудиторію у 75 660 підписників.
За останніми даними від 11 червня, 2026, канал демонструє стабільну активність. Хоча за останні 30 днів спостерігається зміна кількості учасників на 911, а за останні 24 години на 29, загальне охоплення залишається високим.
- Статус верифікації: Не верифікований
- Рівень залученості (ER): Середній показник залученості аудиторії становить 3.63%. Протягом перших 24 годин після публікації контент зазвичай збирає 1.36% реакцій від загальної кількості підписників.
- Охоплення публікацій: В середньому кожен допис отримує 2 747 переглядів. Протягом першої доби публікація в середньому набирає 1 032 переглядів.
- Реакції та взаємодія: Аудиторія активно підтримує контент: середня кількість реакцій на один пост – 5.
- Тематичні інтереси: Контент зосереджений навколо ключових тем, таких як learning, accuracy, distribution, panda, dataset.
📝 Опис та контентна політика
Автор описує ресурс як майданчик для висловлення суб'єктивної думки:
“Join this channel to learn data science, artificial intelligence and machine learning with funny quizzes, interesting projects and amazing resources for free
For collaborations: @love_data”
Завдяки високій частоті оновлень (останні дані отримано 12 червня, 2026), канал підтримує актуальність та високий рівень охоплення публікацій. Аналітика показує, що аудиторія активно взаємодіє з контентом, що робить його важливою точкою впливу в категорії Освіта.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
Step 2. Load data
df = pd.read_csv("house_prices.csv")
df.head()
Step 3. Basic checks
df.shape
df.info()
df.isnull().sum()
Step 4. Data cleaning
Fill missing values.
df.fillna(df.median(numeric_only=True), inplace=True)
Step 5. Encode categorical variables
le = LabelEncoder()
for col in df.select_dtypes(include='object').columns:
df[col] = le.fit_transform(df[col])
Step 6. Feature scaling
scaler = StandardScaler()
X = df.drop('price', axis=1)
y = df['price']
X_scaled = scaler.fit_transform(X)
Step 7. Train test split
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y, test_size=0.3, random_state=42
)
Step 8. Build model
Linear Regression.
model = LinearRegression()
model.fit(X_train, y_train)
Step 9. Predictions
y_pred = model.predict(X_test)
Step 10. Evaluation
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print("MAE:", mae)
print("RMSE:", rmse)
print("R2:", r2)
Typical results
• R2 between 0.70 to 0.85
• Location and area dominate price
Step 11. Feature importance
importance = pd.DataFrame({
'Feature': X.columns,
'Coefficient': model.coef_
}).sort_values(by='Coefficient', ascending=False)
importance
Interpretation: Positive coefficient increases price. Negative reduces price.
Step 12. Model improvements
• Ridge regression for multicollinearity
• Lasso for feature selection
• Random Forest for non-linear patterns
Resume bullet example
• Built house price prediction model using regression
• Achieved R2 score above 0.8
• Identified key price drivers
Interview explanation flow
• Why RMSE matters
• How multicollinearity affects coefficients
• Why tree models outperform linear sometimes
Mini task for you
• Try Ridge and Lasso
• Compare RMSE
• Plot actual vs predicted
Double Tap ♥️ For Moreimport pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
nltk.download('stopwords')
Step 2. Load data
df = pd.read_csv("sentiment.csv")
df.head()
Example review: "The movie was amazing" sentiment: 1
Step 3. Basic checks
df.shape
df['sentiment'].value_counts()
Step 4. Text cleaning
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))
def clean_text(text):
text = text.lower()
text = re.sub('[^a-z]', ' ', text)
words = text.split()
words = [stemmer.stem(w) for w in words if w not in stop_words]
return ' '.join(words)
df['clean_review'] = df['review'].apply(clean_text)
Step 5. Train test split
X = df['clean_review']
y = df['sentiment']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y
)
Step 6. Text vectorization TF IDF
tfidf = TfidfVectorizer(max_features=5000)
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)
Why TF IDF
• Reduces common word weight
• Keeps meaningful words
Step 7. Model building
model = LogisticRegression(max_iter=1000)
model.fit(X_train_tfidf, y_train)
Step 8. Predictions
y_pred = model.predict(X_test_tfidf)
Step 9. Evaluation
accuracy_score(y_test, y_pred)
confusion_matrix(y_test, y_pred)
print(classification_report(y_test, y_pred))
Typical results
• Accuracy 85 to 90 percent
• Precision strong on positive reviews
• Neutral text harder to classify
Step 10. Test on custom text
sample = ["The product quality is terrible"]
sample_clean = [clean_text(sample[0])]
sample_vec = tfidf.transform(sample_clean)
model.predict(sample_vec)
Output: 0 negative
Common interview questions
• Why TF IDF over CountVectorizer
• How stopwords affect meaning
• Why Logistic Regression works well
Improvements
• Use n grams
• Try Naive Bayes
• Use LSTM or Transformers
Resume bullet example
• Built sentiment analysis model using TF IDF and Logistic Regression
• Achieved 88 percent accuracy on review data
• Automated text preprocessing pipeline
Mini task for you
• Add bigrams
• Compare Naive Bayes
• Plot ROC curve
Double Tap ♥️ For Moreimport pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
Step 2. Load data
df = pd.read_csv("ratings.csv")
df.head()
Example data
user_id | item_id | rating
1 | 101 | 5
1 | 102 | 3
Step 3. Create user item matrix
user_item_matrix = df.pivot_table(
index='user_id',
columns='item_id',
values='rating'
)
Matrix shape
Rows users
Columns items
Values ratings
Step 4. Handle missing values
user_item_matrix.fillna(0, inplace=True)
Why? Cosine similarity needs numbers.
Step 5. Compute user similarity
user_similarity = cosine_similarity(user_item_matrix)
user_similarity_df = pd.DataFrame(
user_similarity,
index=user_item_matrix.index,
columns=user_item_matrix.index
)
Step 6. Find similar users
user_id = 1
similar_users = user_similarity_df[user_id].sort_values(ascending=False)
similar_users.head()
Top result User itself score 1. Ignore it.
Step 7. Recommend items
Get items rated by similar users
similar_users = similar_users[similar_users.index != user_id]
weighted_ratings = user_item_matrix.loc[similar_users.index].T.dot(similar_users)
recommendations = weighted_ratings.sort_values(ascending=False)
Remove already rated items.
already_rated = user_item_matrix.loc[user_id]
already_rated = already_rated[already_rated > 0].index
recommendations = recommendations.drop(already_rated)
recommendations.head(5)
Output Top 5 recommended item IDs.
Step 8. Why cosine similarity
• Focuses on rating pattern
• Ignores scale differences
• Fast and simple
Limitations
• Cold start problem
• Sparse matrix
• No item features
Improvements
• Item based filtering
• Matrix factorization
• Hybrid models
Resume bullet example
• Built recommendation system using collaborative filtering
• Used cosine similarity on user item matrix
• Generated personalized item recommendations
Interview explanation flow
• Difference between content based and collaborative
• Why sparsity hurts
• Cold start solutions
Mini task for you
• Convert to item based filtering
• Add minimum similarity threshold
• Evaluate using precision at K
Double Tap ♥️ For Moreimport pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_absolute_error, mean_squared_error
Step 2. Load Data
df = pd.read_csv("sales.csv")
df.head()
Step 3. Date Handling
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)
# Sort by date
df = df.sort_index()
Step 4. Visualize Sales Trend
plt.plot(df.index, df['Sales'])
plt.title("Sales over time")
plt.show()
What you observe:
- Trend
- Seasonality
- Sudden spikes
Step 5. Decompose Time Series
decomposition = seasonal_decompose(df['Sales'], model='additive')
decomposition.plot()
plt.show()
Insight
- Trend shows long-term growth
- Seasonality repeats yearly or monthly
Step 6. Train Test Split
Split by time.
train = df.iloc[:-12]
test = df.iloc[-12:]
Why Last 12 months simulate future.
Step 7. Build ARIMA Model
model = ARIMA(train['Sales'], order=(1,1,1))
model_fit = model.fit() # corrected from (link unavailable)
Order meaning
- p: autoregressive
- d: differencing
- q: moving average
Step 8. Forecast
forecast = model_fit.forecast(steps=12)
print(forecast)
Step 9. Plot Forecast vs Actual
plt.plot(train.index, train['Sales'], label='Train')
plt.plot(test.index, test['Sales'], label='Actual')
plt.plot(test.index, forecast, label='Forecast')
plt.legend()
plt.show()
Step 10. Evaluation
mae = mean_absolute_error(test['Sales'], forecast)
rmse = np.sqrt(mean_squared_error(test['Sales'], forecast))
print("MAE:", mae)
print("RMSE:", rmse)
Typical results:
- RMSE depends on scale
- Trend captured well
- Peaks harder to predict
Step 11. Business Interpretation
- Underforecast leads to stockouts
- Overforecast leads to inventory waste
- Accuracy matters near peaks
Model Improvement Ideas
- SARIMA for seasonality
- Prophet for business calendars
- Add promotions and holidays
Resume Bullet Example
- Built time series model to forecast monthly sales
- Used ARIMA with rolling time-based split
- Reduced forecasting error using trend analysis
Interview Explanation Flow
- Why random split fails
- Importance of seasonality
- Error metrics selection
Mini Task for You
- Try SARIMA
- Forecast next 24 months
- Compare RMSE across models
Double Tap ♥️ For Moreimport pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score
Step 2. Load data
df = pd.read_csv("creditcard.csv")
df.head()
Step 3. Basic checks
df.shape
df['Class'].value_counts()
Output example:
• Genuine 284315
• Fraud 492
Step 4. Data understanding
Check class imbalance:
sns.countplot(x='Class', data=df)
plt.show()
Insight Highly imbalanced dataset.
Step 5. Feature scaling
Scale Amount column:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df['Amount'] = scaler.fit_transform(df[['Amount']])
Drop Time.python
df.drop('Time', axis=1, inplace=True)
Step 6. Split features and target
X = df.drop('Class', axis=1)
y = df['Class']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y
)
Step 7. Baseline model
Logistic Regression with class weight:
model = LogisticRegression(
max_iter=1000, class_weight='balanced'
)
model.fit(X_train, y_train)
Why class_weight
• Penalizes fraud mistakes more
• Improves recall
Step 8. Predictions
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:,1]
Step 9. Evaluation
Confusion matrix:
confusion_matrix(y_test, y_pred)Classification report:
print(classification_report(y_test, y_pred))
ROC AUC:
roc_auc_score(y_test, y_prob)
Typical results
• Accuracy looks high but ignored
• Fraud recall improves sharply
• ROC AUC around 0.97
Step 10. Threshold tuning
Increase fraud recall:
y_pred_custom = (y_prob > 0.3).astype(int) confusion_matrix(y_test, y_pred_custom)Business logic Lower threshold catches more fraud. More false alerts accepted. Step 11. Advanced approach Random Forest:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(
n_estimators=100, class_weight='balanced', random_state=42
)
rf.fit(X_train, y_train)
rf_prob = rf.predict_proba(X_test)[:,1]
roc_auc_score(y_test, rf_prob)
Resume bullet example
- Built fraud detection model on highly imbalanced data
- Improved fraud recall using class weighting and threshold tuning
- Evaluated model using ROC AUC instead of accuracy
Interview explanation flow
- Explain imbalance problem
- Why accuracy fails
- Why recall matters
- How threshold changes business impact
Mini task for you
- Apply SMOTE
- Compare with Isolation Forest
- Plot Precision Recall curve
Double Tap ♥️ For Moreimport pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
Step 2. Load data
df = pd.read_csv("customer_churn.csv")
df.head()
Step 3. Basic checks
df.shape
df.info()
df.isnull().sum()
Step 4. Data cleaning
Convert TotalCharges to numeric.
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df['TotalCharges'].fillna(df['TotalCharges'].median(), inplace=True)
Drop customer ID.
df.drop('customerID', axis=1, inplace=True)
Step 5. Exploratory Data Analysis
Churn distribution.
sns.countplot(x='Churn', data=df)
plt.show()
Tenure vs churn.
sns.boxplot(x='Churn', y='tenure', data=df)
plt.show()
Common insights:
• Month-to-month contracts churn more
• Low tenure users churn early
• High monthly charges increase churn
Step 6. Encode categorical variables
le = LabelEncoder()
for col in df.select_dtypes(include='object').columns:
df[col] = le.fit_transform(df[col])
Step 7. Feature scaling
scaler = StandardScaler()
num_cols = ['tenure', 'MonthlyCharges', 'TotalCharges']
df[num_cols] = scaler.fit_transform(df[num_cols])
Step 8. Split data
X = df.drop('Churn', axis=1)
y = df['Churn']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y
)
Step 9. Build model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
Step 10. Predictions
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:,1]
Step 11. Evaluation
confusion_matrix(y_test, y_pred)
print(classification_report(y_test, y_pred))
roc_auc_score(y_test, y_prob)
Typical results:
• Accuracy around 78 to 83 percent
• ROC AUC around 0.84
• Recall for churn is key metric
Step 12. Business actions from model
• Target high-risk users
• Offer discounts to month-to-month users
• Push yearly contracts
• Improve onboarding for first 90 days
Resume bullet example:
• Built churn prediction model using Logistic Regression
• Identified contract type and tenure as top churn drivers
• Improved churn recall using class-aware split
Interview explanation flow:
• Revenue loss problem
• Why recall matters more than accuracy
• How features map to actions
Mini task for you:
• Train Random Forest
• Compare ROC AUC
• Tune threshold for higher recall
Double Tap ♥️ For Part-3
Вже доступно! Дослідження Telegram за 2025 — головні інсайти року 
