Data Science & Machine Learning
Join this channel to learn data science, artificial intelligence and machine learning with funny quizzes, interesting projects and amazing resources for free For collaborations: @love_data
Mostrar más📈 Análisis del canal de Telegram Data Science & Machine Learning
El canal Data Science & Machine Learning (@datasciencefun) en el segmento lingüístico de Inglés es un actor destacado. Actualmente la comunidad reúne a 75 660 suscriptores, ocupando la posición 2 114 en la categoría Educación y el puesto 4 359 en la región India.
📊 Métricas de audiencia y dinámica
Desde su creación el невідомо, el proyecto ha mostrado un crecimiento acelerado, reuniendo a 75 660 suscriptores.
Según los últimos datos del 11 junio, 2026, el canal mantiene una actividad estable. En los últimos 30 días la variación de miembros fue de 911, y en las últimas 24 horas de 29, conservando un alto alcance.
- Estado de verificación: No verificado
- Tasa de interacción (ER): El promedio de interacción de la audiencia es 3.63%. Durante las primeras 24 horas tras publicar, el contenido suele obtener 1.36% de reacciones respecto al total de suscriptores.
- Alcance de las publicaciones: Cada publicación recibe en promedio 2 747 visualizaciones. En el primer día suele acumular 1 032 visualizaciones.
- Reacciones e interacción: La audiencia responde de forma activa: el promedio de reacciones por publicación es 5.
- Intereses temáticos: El contenido se centra en temas clave como learning, accuracy, distribution, panda, dataset.
📝 Descripción y política de contenido
El autor describe el recurso como un espacio para expresar opiniones subjetivas:
“Join this channel to learn data science, artificial intelligence and machine learning with funny quizzes, interesting projects and amazing resources for free
For collaborations: @love_data”
Gracias a la alta frecuencia de actualizaciones (últimos datos recibidos el 12 junio, 2026), el canal mantiene la vigencia y un amplio alcance. La analítica demuestra que la audiencia interactúa activamente con el contenido, lo que lo convierte en un punto de referencia dentro de la categoría Educación.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
Step 2. Load data
df = pd.read_csv("house_prices.csv")
df.head()
Step 3. Basic checks
df.shape
df.info()
df.isnull().sum()
Step 4. Data cleaning
Fill missing values.
df.fillna(df.median(numeric_only=True), inplace=True)
Step 5. Encode categorical variables
le = LabelEncoder()
for col in df.select_dtypes(include='object').columns:
df[col] = le.fit_transform(df[col])
Step 6. Feature scaling
scaler = StandardScaler()
X = df.drop('price', axis=1)
y = df['price']
X_scaled = scaler.fit_transform(X)
Step 7. Train test split
X_train, X_test, y_train, y_test = train_test_split(
X_scaled, y, test_size=0.3, random_state=42
)
Step 8. Build model
Linear Regression.
model = LinearRegression()
model.fit(X_train, y_train)
Step 9. Predictions
y_pred = model.predict(X_test)
Step 10. Evaluation
mae = mean_absolute_error(y_test, y_pred)
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print("MAE:", mae)
print("RMSE:", rmse)
print("R2:", r2)
Typical results
• R2 between 0.70 to 0.85
• Location and area dominate price
Step 11. Feature importance
importance = pd.DataFrame({
'Feature': X.columns,
'Coefficient': model.coef_
}).sort_values(by='Coefficient', ascending=False)
importance
Interpretation: Positive coefficient increases price. Negative reduces price.
Step 12. Model improvements
• Ridge regression for multicollinearity
• Lasso for feature selection
• Random Forest for non-linear patterns
Resume bullet example
• Built house price prediction model using regression
• Achieved R2 score above 0.8
• Identified key price drivers
Interview explanation flow
• Why RMSE matters
• How multicollinearity affects coefficients
• Why tree models outperform linear sometimes
Mini task for you
• Try Ridge and Lasso
• Compare RMSE
• Plot actual vs predicted
Double Tap ♥️ For Moreimport pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
nltk.download('stopwords')
Step 2. Load data
df = pd.read_csv("sentiment.csv")
df.head()
Example review: "The movie was amazing" sentiment: 1
Step 3. Basic checks
df.shape
df['sentiment'].value_counts()
Step 4. Text cleaning
stemmer = PorterStemmer()
stop_words = set(stopwords.words('english'))
def clean_text(text):
text = text.lower()
text = re.sub('[^a-z]', ' ', text)
words = text.split()
words = [stemmer.stem(w) for w in words if w not in stop_words]
return ' '.join(words)
df['clean_review'] = df['review'].apply(clean_text)
Step 5. Train test split
X = df['clean_review']
y = df['sentiment']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y
)
Step 6. Text vectorization TF IDF
tfidf = TfidfVectorizer(max_features=5000)
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)
Why TF IDF
• Reduces common word weight
• Keeps meaningful words
Step 7. Model building
model = LogisticRegression(max_iter=1000)
model.fit(X_train_tfidf, y_train)
Step 8. Predictions
y_pred = model.predict(X_test_tfidf)
Step 9. Evaluation
accuracy_score(y_test, y_pred)
confusion_matrix(y_test, y_pred)
print(classification_report(y_test, y_pred))
Typical results
• Accuracy 85 to 90 percent
• Precision strong on positive reviews
• Neutral text harder to classify
Step 10. Test on custom text
sample = ["The product quality is terrible"]
sample_clean = [clean_text(sample[0])]
sample_vec = tfidf.transform(sample_clean)
model.predict(sample_vec)
Output: 0 negative
Common interview questions
• Why TF IDF over CountVectorizer
• How stopwords affect meaning
• Why Logistic Regression works well
Improvements
• Use n grams
• Try Naive Bayes
• Use LSTM or Transformers
Resume bullet example
• Built sentiment analysis model using TF IDF and Logistic Regression
• Achieved 88 percent accuracy on review data
• Automated text preprocessing pipeline
Mini task for you
• Add bigrams
• Compare Naive Bayes
• Plot ROC curve
Double Tap ♥️ For Moreimport pandas as pd
import numpy as np
from sklearn.metrics.pairwise import cosine_similarity
Step 2. Load data
df = pd.read_csv("ratings.csv")
df.head()
Example data
user_id | item_id | rating
1 | 101 | 5
1 | 102 | 3
Step 3. Create user item matrix
user_item_matrix = df.pivot_table(
index='user_id',
columns='item_id',
values='rating'
)
Matrix shape
Rows users
Columns items
Values ratings
Step 4. Handle missing values
user_item_matrix.fillna(0, inplace=True)
Why? Cosine similarity needs numbers.
Step 5. Compute user similarity
user_similarity = cosine_similarity(user_item_matrix)
user_similarity_df = pd.DataFrame(
user_similarity,
index=user_item_matrix.index,
columns=user_item_matrix.index
)
Step 6. Find similar users
user_id = 1
similar_users = user_similarity_df[user_id].sort_values(ascending=False)
similar_users.head()
Top result User itself score 1. Ignore it.
Step 7. Recommend items
Get items rated by similar users
similar_users = similar_users[similar_users.index != user_id]
weighted_ratings = user_item_matrix.loc[similar_users.index].T.dot(similar_users)
recommendations = weighted_ratings.sort_values(ascending=False)
Remove already rated items.
already_rated = user_item_matrix.loc[user_id]
already_rated = already_rated[already_rated > 0].index
recommendations = recommendations.drop(already_rated)
recommendations.head(5)
Output Top 5 recommended item IDs.
Step 8. Why cosine similarity
• Focuses on rating pattern
• Ignores scale differences
• Fast and simple
Limitations
• Cold start problem
• Sparse matrix
• No item features
Improvements
• Item based filtering
• Matrix factorization
• Hybrid models
Resume bullet example
• Built recommendation system using collaborative filtering
• Used cosine similarity on user item matrix
• Generated personalized item recommendations
Interview explanation flow
• Difference between content based and collaborative
• Why sparsity hurts
• Cold start solutions
Mini task for you
• Convert to item based filtering
• Add minimum similarity threshold
• Evaluate using precision at K
Double Tap ♥️ For Moreimport pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from statsmodels.tsa.seasonal import seasonal_decompose
from statsmodels.tsa.arima.model import ARIMA
from sklearn.metrics import mean_absolute_error, mean_squared_error
Step 2. Load Data
df = pd.read_csv("sales.csv")
df.head()
Step 3. Date Handling
df['Date'] = pd.to_datetime(df['Date'])
df.set_index('Date', inplace=True)
# Sort by date
df = df.sort_index()
Step 4. Visualize Sales Trend
plt.plot(df.index, df['Sales'])
plt.title("Sales over time")
plt.show()
What you observe:
- Trend
- Seasonality
- Sudden spikes
Step 5. Decompose Time Series
decomposition = seasonal_decompose(df['Sales'], model='additive')
decomposition.plot()
plt.show()
Insight
- Trend shows long-term growth
- Seasonality repeats yearly or monthly
Step 6. Train Test Split
Split by time.
train = df.iloc[:-12]
test = df.iloc[-12:]
Why Last 12 months simulate future.
Step 7. Build ARIMA Model
model = ARIMA(train['Sales'], order=(1,1,1))
model_fit = model.fit() # corrected from (link unavailable)
Order meaning
- p: autoregressive
- d: differencing
- q: moving average
Step 8. Forecast
forecast = model_fit.forecast(steps=12)
print(forecast)
Step 9. Plot Forecast vs Actual
plt.plot(train.index, train['Sales'], label='Train')
plt.plot(test.index, test['Sales'], label='Actual')
plt.plot(test.index, forecast, label='Forecast')
plt.legend()
plt.show()
Step 10. Evaluation
mae = mean_absolute_error(test['Sales'], forecast)
rmse = np.sqrt(mean_squared_error(test['Sales'], forecast))
print("MAE:", mae)
print("RMSE:", rmse)
Typical results:
- RMSE depends on scale
- Trend captured well
- Peaks harder to predict
Step 11. Business Interpretation
- Underforecast leads to stockouts
- Overforecast leads to inventory waste
- Accuracy matters near peaks
Model Improvement Ideas
- SARIMA for seasonality
- Prophet for business calendars
- Add promotions and holidays
Resume Bullet Example
- Built time series model to forecast monthly sales
- Used ARIMA with rolling time-based split
- Reduced forecasting error using trend analysis
Interview Explanation Flow
- Why random split fails
- Importance of seasonality
- Error metrics selection
Mini Task for You
- Try SARIMA
- Forecast next 24 months
- Compare RMSE across models
Double Tap ♥️ For Moreimport pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report, roc_auc_score
Step 2. Load data
df = pd.read_csv("creditcard.csv")
df.head()
Step 3. Basic checks
df.shape
df['Class'].value_counts()
Output example:
• Genuine 284315
• Fraud 492
Step 4. Data understanding
Check class imbalance:
sns.countplot(x='Class', data=df)
plt.show()
Insight Highly imbalanced dataset.
Step 5. Feature scaling
Scale Amount column:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
df['Amount'] = scaler.fit_transform(df[['Amount']])
Drop Time.python
df.drop('Time', axis=1, inplace=True)
Step 6. Split features and target
X = df.drop('Class', axis=1)
y = df['Class']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y
)
Step 7. Baseline model
Logistic Regression with class weight:
model = LogisticRegression(
max_iter=1000, class_weight='balanced'
)
model.fit(X_train, y_train)
Why class_weight
• Penalizes fraud mistakes more
• Improves recall
Step 8. Predictions
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:,1]
Step 9. Evaluation
Confusion matrix:
confusion_matrix(y_test, y_pred)Classification report:
print(classification_report(y_test, y_pred))
ROC AUC:
roc_auc_score(y_test, y_prob)
Typical results
• Accuracy looks high but ignored
• Fraud recall improves sharply
• ROC AUC around 0.97
Step 10. Threshold tuning
Increase fraud recall:
y_pred_custom = (y_prob > 0.3).astype(int) confusion_matrix(y_test, y_pred_custom)Business logic Lower threshold catches more fraud. More false alerts accepted. Step 11. Advanced approach Random Forest:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(
n_estimators=100, class_weight='balanced', random_state=42
)
rf.fit(X_train, y_train)
rf_prob = rf.predict_proba(X_test)[:,1]
roc_auc_score(y_test, rf_prob)
Resume bullet example
- Built fraud detection model on highly imbalanced data
- Improved fraud recall using class weighting and threshold tuning
- Evaluated model using ROC AUC instead of accuracy
Interview explanation flow
- Explain imbalance problem
- Why accuracy fails
- Why recall matters
- How threshold changes business impact
Mini task for you
- Apply SMOTE
- Compare with Isolation Forest
- Plot Precision Recall curve
Double Tap ♥️ For Moreimport pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score
Step 2. Load data
df = pd.read_csv("customer_churn.csv")
df.head()
Step 3. Basic checks
df.shape
df.info()
df.isnull().sum()
Step 4. Data cleaning
Convert TotalCharges to numeric.
df['TotalCharges'] = pd.to_numeric(df['TotalCharges'], errors='coerce')
df['TotalCharges'].fillna(df['TotalCharges'].median(), inplace=True)
Drop customer ID.
df.drop('customerID', axis=1, inplace=True)
Step 5. Exploratory Data Analysis
Churn distribution.
sns.countplot(x='Churn', data=df)
plt.show()
Tenure vs churn.
sns.boxplot(x='Churn', y='tenure', data=df)
plt.show()
Common insights:
• Month-to-month contracts churn more
• Low tenure users churn early
• High monthly charges increase churn
Step 6. Encode categorical variables
le = LabelEncoder()
for col in df.select_dtypes(include='object').columns:
df[col] = le.fit_transform(df[col])
Step 7. Feature scaling
scaler = StandardScaler()
num_cols = ['tenure', 'MonthlyCharges', 'TotalCharges']
df[num_cols] = scaler.fit_transform(df[num_cols])
Step 8. Split data
X = df.drop('Churn', axis=1)
y = df['Churn']
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.3, random_state=42, stratify=y
)
Step 9. Build model
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)
Step 10. Predictions
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:,1]
Step 11. Evaluation
confusion_matrix(y_test, y_pred)
print(classification_report(y_test, y_pred))
roc_auc_score(y_test, y_prob)
Typical results:
• Accuracy around 78 to 83 percent
• ROC AUC around 0.84
• Recall for churn is key metric
Step 12. Business actions from model
• Target high-risk users
• Offer discounts to month-to-month users
• Push yearly contracts
• Improve onboarding for first 90 days
Resume bullet example:
• Built churn prediction model using Logistic Regression
• Identified contract type and tenure as top churn drivers
• Improved churn recall using class-aware split
Interview explanation flow:
• Revenue loss problem
• Why recall matters more than accuracy
• How features map to actions
Mini task for you:
• Train Random Forest
• Compare ROC AUC
• Tune threshold for higher recall
Double Tap ♥️ For Part-3
¡Ya disponible! Investigación de Telegram 2025 — los principales insights del año 
