Data Science & Machine Learning
Join this channel to learn data science, artificial intelligence and machine learning with funny quizzes, interesting projects and amazing resources for free For collaborations: @love_data
Show moreπ Analytical overview of Telegram channel Data Science & Machine Learning
Channel Data Science & Machine Learning (@datasciencefun) in the English language segment is an active participant. Currently, the community unites 75 818 subscribers, ranking 2 113 in the Education category and 4 286 in the India region.
π Audience metrics and dynamics
Since its creation on Π½Π΅Π²ΡΠ΄ΠΎΠΌΠΎ, the project has demonstrated rapid growth, gathering an audience of 75 818 subscribers.
According to the latest data from 18 June, 2026, the channel demonstrates stable activity. Although there has been a change in the number of participants by 884 over the last 30 days and by 6 over the last 24 hours, overall reach remains high.
- Verification status: Not verified
- Engagement rate (ER): The average audience engagement rate is 3.25%. Within the first 24 hours after publication, content typically collects 1.38% reactions from the total number of subscribers.
- Post reach: On average, each post receives 2 462 views. Within the first day, a publication typically gains 1 043 views.
- Reactions and interaction: The audience actively supports content: the average number of reactions per post is 4.
- Thematic interests: Content is focused on key topics such as learning, accuracy, distribution, panda, dataset.
π Description and content policy
The author describes the resource as a platform for expressing subjective opinions:
βJoin this channel to learn data science, artificial intelligence and machine learning with funny quizzes, interesting projects and amazing resources for free
For collaborations: @love_dataβ
Thanks to the high frequency of updates (latest data received on 19 June, 2026), the channel maintains relevance and a high level of publication reach. Analytics show that the audience actively interacts with content, making it an important point of influence in the Education category.
MinPts neighbors within a radius of Ξ΅.
- Border Point: A point that is not a core point but is within the neighborhood of a core point.
- Noise Point: A point that is neither a core point nor a border point (outlier).
#### Algorithm Steps
1. Identify Core Points: For each point in the dataset, find its Ξ΅-neighborhood. If it contains at least MinPts points, mark it as a core point.
2. Expand Clusters: From each core point, recursively collect directly density-reachable points to form a cluster.
3. Label Border and Noise Points: Points that are reachable from core points but not core points themselves are labeled as border points. Points that are not reachable from any core point are labeled as noise.
#### Implementation
Let's consider an example using Python and its libraries.
##### Example
Suppose we have a dataset with points in a 2D space, and we want to cluster them using DBSCAN.
# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.datasets import make_moons
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
import seaborn as sns
# Generate example data (make_moons dataset)
X, y = make_moons(n_samples=300, noise=0.1, random_state=0)
# Applying DBSCAN
epsilon = 0.2
min_samples = 5
db = DBSCAN(eps=epsilon, min_samples=min_samples)
clusters = db.fit_predict(X)
# Adding cluster labels to the dataframe
df = pd.DataFrame(X, columns=['Feature 1', 'Feature 2'])
df['Cluster'] = clusters
# Plotting the clusters
plt.figure(figsize=(8, 6))
sns.scatterplot(x='Feature 1', y='Feature 2', hue='Cluster', palette='Set1', data=df)
plt.title('DBSCAN Clustering')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.show()
#### Explanation of the Code
1. Libraries: We import necessary libraries like numpy, pandas, sklearn, matplotlib, and seaborn.
2. Data Preparation: We generate a synthetic dataset using make_moons with two features.
3. Applying DBSCAN: We apply the DBSCAN algorithm with specified epsilon and min_samples values to cluster the data.
4. Adding Cluster Labels: We create a DataFrame with the features and cluster labels.
5. Plotting: We scatter plot the data points with colors indicating different clusters.
#### Choosing Parameters
Choosing appropriate values for Ξ΅ and MinPts is crucial:
- Epsilon (Ξ΅): Often determined using a k-distance graph where k = MinPts - 1. A sudden change in the slope can suggest a good value for Ξ΅.
- MinPts: Typically set to at least the dimensionality of the dataset plus one. For 2D data, a common value is 4 or 5.
#### Handling Outliers
DBSCAN can identify outliers as noise points. These are points that do not belong to any cluster, making DBSCAN robust to noise in the data.
#### Applications
DBSCAN is widely used in:
- Geospatial Data Analysis: Identifying regions of interest in spatial data.
- Image Segmentation: Grouping pixels into regions based on their intensity.
- Anomaly Detection: Identifying unusual patterns or outliers in datasets.
DBSCAN is powerful for discovering clusters of arbitrary shape and handling noise effectively. However, it can struggle with varying densities and requires careful tuning of parameters.
Cracking the Data Science Interview
ππ
https://topmate.io/analyst/1024129
Credits: t.me/datasciencefun
ENJOY LEARNING ππ# Import necessary libraries
import pandas as pd
from mlxtend.frequent_patterns import apriori, association_rules
# Example data: list of transactions
data = {'TransactionID': [1, 1, 1, 2, 2, 3, 3, 3, 4, 4, 4, 4],
'Item': ['Milk', 'Bread', 'Butter', 'Bread', 'Butter', 'Milk', 'Bread', 'Eggs', 'Milk', 'Bread', 'Butter', 'Eggs']}
df = pd.DataFrame(data)
df = df.groupby(['TransactionID', 'Item'])['Item'].count().unstack().reset_index().fillna(0).set_index('TransactionID')
df = df.applymap(lambda x: 1 if x > 0 else 0)
# Applying the Apriori algorithm
frequent_itemsets = apriori(df, min_support=0.5, use_colnames=True)
# Generating association rules
rules = association_rules(frequent_itemsets, metric='confidence', min_threshold=0.7)
print("Frequent Itemsets:")
print(frequent_itemsets)
print("\nAssociation Rules:")
print(rules)
#### Explanation of the Code
1. Libraries: We import necessary libraries like pandas and mlxtend.
2. Data Preparation: We create a transaction dataset and transform it into a format suitable for the Apriori algorithm, where each row represents a transaction and each column represents an item.
3. Apriori Algorithm: We apply the Apriori algorithm to find frequent itemsets with a minimum support of 0.5.
4. Association Rules: We generate association rules from the frequent itemsets with a minimum confidence of 0.7.
#### Evaluation Metrics
- Support: Measures the frequency of an itemset in the dataset.
- Confidence: Measures the reliability of the inference made by the rule.
- Lift: Measures the strength of the rule over random co-occurrence. Lift values greater than 1 indicate a strong association.
#### Applications
Association rule learning is widely used in:
- Market Basket Analysis: Identifying products frequently bought together to optimize store layouts and cross-selling strategies.
- Recommendation Systems: Recommending products or services based on customer purchase history.
- Healthcare: Discovering associations between medical conditions and treatments.
Best Data Science & Machine Learning Resources: https://topmate.io/coding/914624
Credits: t.me/datasciencefun
ENJOY LEARNING ππ# Import necessary libraries
import numpy as np
import pandas as pd
from scipy.cluster.hierarchy import dendrogram, linkage, fcluster
import matplotlib.pyplot as plt
import seaborn as sns
# Example data
np.random.seed(0)
X = np.vstack((np.random.normal(0, 1, (100, 2)),
np.random.normal(5, 1, (100, 2)),
np.random.normal(-5, 1, (100, 2))))
# Performing hierarchical clustering
Z = linkage(X, method='ward')
# Plotting the dendrogram
plt.figure(figsize=(10, 7))
dendrogram(Z, truncate_mode='level', p=5, leaf_rotation=90., leaf_font_size=12., show_contracted=True)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Sample index')
plt.ylabel('Distance')
plt.show()
# Cutting the dendrogram to form clusters
max_d = 7.0 # Example threshold for cutting the dendrogram
clusters = fcluster(Z, max_d, criterion='distance')
# Plotting the clusters
plt.figure(figsize=(8, 6))
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=clusters, palette='viridis', s=50, edgecolor='k')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('Hierarchical Clustering')
plt.show()
## Explanation of the Code
1. Importing Libraries
2. Data Preparation: We generate a synthetic dataset with three clusters using normal distributions.
3. Linkage: We use the linkage function from scipy.cluster.hierarchy to perform hierarchical clustering with Ward's method.
4. Dendrogram: We plot the dendrogram using the dendrogram function to visualize the hierarchical structure.
5. Cutting the Dendrogram: We cut the dendrogram at a specific threshold to form clusters using the fcluster function.
6. Plotting Clusters: We scatter plot the data points with colors indicating the assigned clusters.
#### Choosing the Number of Clusters
The dendrogram helps visualize the hierarchy of clusters. The choice of where to cut the dendrogram (i.e., selecting a threshold distance) determines the number of clusters. This choice can be subjective, but some guidelines include:
- Elbow Method: Similar to k-Means, look for an "elbow" in the dendrogram where the distance between merges increases significantly.
- Maximum Distance: Choose a distance threshold that balances the number of clusters and the compactness of clusters.
## Applications
Hierarchical clustering is widely used in:
- Gene Expression Data: Grouping similar genes or samples in bioinformatics.
- Document Clustering: Organizing documents into a hierarchical structure.
- Image Segmentation: Dividing an image into regions based on pixel similarity.
Credits: t.me/datasciencefun
Cracking the Data Science Interview
ENJOY LEARNING ππ# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
import seaborn as sns
# Example data
np.random.seed(0)
X = np.vstack((np.random.normal(0, 1, (100, 2)),
np.random.normal(5, 1, (100, 2)),
np.random.normal(-5, 1, (100, 2))))
# Applying k-Means clustering
k = 3
kmeans = KMeans(n_clusters=k, random_state=0)
y_kmeans = kmeans.fit_predict(X)
# Plotting the clusters
plt.figure(figsize=(8,6))
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y_kmeans, palette='viridis', s=50, edgecolor='k')
plt.scatter(kmeans.cluster_centers_[:, 0], kmeans.cluster_centers_[:, 1], s=300, c='red', label='Centroids')
plt.xlabel('Feature 1')
plt.ylabel('Feature 2')
plt.title('k-Means Clustering')
plt.legend()
plt.show()
## Explanation of the Code
1. Libraries: We import necessary libraries like numpy, pandas, sklearn, matplotlib, and seaborn.
2. Data Preparation: We generate a synthetic dataset with three clusters using normal distributions.
3. k-Means Clustering: We create a KMeans object with \( k=3 \) clusters and fit it to the data. The fit_predict method assigns each data point to a cluster.
4. Plotting: We scatter plot the data points with colors indicating the assigned clusters and plot the centroids in red.
#### Choosing the Number of Clusters
Selecting the appropriate number of clusters (\( k \)) is crucial. Common methods to determine \( k \) include:
- Elbow Method: Plot the within-cluster sum of squares (WCSS) against the number of clusters and look for an "elbow" point where the rate of decrease sharply slows.
- Silhouette Score: Measures how similar an object is to its own cluster compared to other clusters. Higher silhouette scores indicate better-defined clusters.
## Elbow Method Example
# Elbow Method to find the optimal number of clusters
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, random_state=0)
kmeans.fit(X)
wcss.append(kmeans.inertia_)
plt.figure(figsize=(8,6))
plt.plot(range(1, 11), wcss, marker='o')
plt.xlabel('Number of clusters')
plt.ylabel('WCSS')
plt.title('Elbow Method')
plt.show()
## Evaluation Metrics
- Within-Cluster Sum of Squares (WCSS): Measures the compactness of the clusters. Lower WCSS indicates more compact clusters.
- Silhouette Score: Measures the separation between clusters. Values range from -1 to 1, with higher values indicating better-defined clusters.
#### Applications
k-Means clustering is widely used in:
- Market Segmentation: Grouping customers based on purchasing behavior.
- Image Compression: Reducing the number of colors in an image.
- Anomaly Detection: Identifying outliers in a dataset.
k-Means is efficient and easy to implement but can be sensitive to the initial placement of centroids and the choice of \( k \). It works well for spherical clusters but may struggle with non-spherical or overlapping clusters.
Best Data Science & Machine Learning Resources: https://topmate.io/coding/914624
Credits: t.me/datasciencefun
ENJOY LEARNING ππ# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
import matplotlib.pyplot as plt
# Example data (Iris dataset)
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y = iris.target
# Standardizing the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
# Applying PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
# Plotting the principal components
plt.figure(figsize=(8,6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, cmap='viridis', edgecolor='k', s=50)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA of Iris Dataset')
plt.colorbar()
plt.show()
# Explained variance
explained_variance = pca.explained_variance_ratio_
print(f"Explained Variance by Component 1: {explained_variance[0]:.2f}")
print(f"Explained Variance by Component 2: {explained_variance[1]:.2f}")
#### Explanation of the Code
1. Libraries: We import necessary libraries like numpy, pandas, sklearn, and matplotlib.
2. Data Preparation: We use the Iris dataset with four features.
3. Standardization: We standardize the features to have zero mean and unit variance.
4. Applying PCA: We create a PCA object with 2 components and fit it to the standardized data, then transform the data to the new 2-dimensional subspace.
5. Plotting: We scatter plot the principal components with color indicating different classes.
6. Explained Variance: We print the proportion of variance explained by the first two principal components.
#### Explained Variance
- Explained Variance: Indicates how much of the total variance in the data is captured by each principal component. In our example, if the first principal component explains 72% of the variance and the second explains 23%, together they explain 95% of the variance.
#### Applications
PCA is widely used in:
- Data Visualization: Reducing high-dimensional data to 2 or 3 dimensions for visualization.
- Noise Reduction: Removing noise by retaining only the principal components with significant variance.
- Feature Extraction: Deriving new features that capture the essential information.
PCA is a powerful tool for simplifying complex datasets while retaining the most important information. However, it assumes linear relationships among variables and may not capture complex patterns in the data.
Best Data Science & Machine Learning Resources: https://topmate.io/coding/914624
Credits: t.me/datasciencefun
ENJOY LEARNING ππ# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
# Example data
data = {
'Feature1': [1, 2, 3, 4, 5, 1, 2, 3, 4, 5],
'Feature2': [5, 4, 3, 2, 1, 5, 4, 3, 2, 1],
'Feature3': [1, 1, 1, 1, 1, 0, 0, 0, 0, 0],
'Spam': [0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
}
df = pd.DataFrame(data)
# Independent variables (features) and dependent variable (target)
X = df[['Feature1', 'Feature2', 'Feature3']]
y = df['Spam']
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Creating and training the Multinomial Naive Bayes model
model = MultinomialNB()
model.fit(X_train, y_train)
# Making predictions
y_pred = model.predict(X_test)
# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print(f"Confusion Matrix:\n{conf_matrix}")
print(f"Classification Report:\n{class_report}")
#### Explanation of the Code
1. Libraries: We import necessary libraries like numpy, pandas, and sklearn.
2. Data Preparation: We create a DataFrame containing features (Feature1, Feature2, Feature3) and the target variable (Spam).
3. Feature and Target: We separate the features and the target variable.
4. Train-Test Split: We split the data into training and testing sets.
5. Model Training: We create a MultinomialNB model and train it using the training data.
6. Predictions: We use the trained model to predict whether the emails in the test set are spam.
7. Evaluation: We evaluate the model using accuracy, confusion matrix, and classification report.
#### Evaluation Metrics
- Accuracy: The proportion of correctly classified instances among the total instances.
- Confusion Matrix: Shows the counts of true positives, true negatives, false positives, and false negatives.
- Classification Report: Provides precision, recall, F1-score, and support for each class.
#### Applications
Naive Bayes classifiers are widely used for:
- Text Classification: Spam detection, sentiment analysis, and document categorization.
- Medical Diagnosis: Predicting diseases based on symptoms.
- Recommendation Systems: Recommending products or services based on user behavior.
Cracking the Data Science Interview
ππ
https://topmate.io/analyst/1024129
Credits: t.me/datasciencefun
ENJOY LEARNING ππ# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
# Example data (Iris dataset)
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data[:, :2] # Using sepal length and sepal width as features
y = iris.target
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Creating and training the KNN model with k=5
model = KNeighborsClassifier(n_neighbors=5)
model.fit(X_train, y_train)
# Making predictions
y_pred = model.predict(X_test)
# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print(f"Confusion Matrix:\n{conf_matrix}")
print(f"Classification Report:\n{class_report}")
# Plotting the decision boundary
def plot_decision_boundary(X, y, model):
h = .02 # step size in the mesh
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=0.8)
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y, palette='bright', edgecolor='k', s=50)
plt.xlabel('Sepal Length')
plt.ylabel('Sepal Width')
plt.title('KNN Decision Boundary')
plt.show()
plot_decision_boundary(X_test, y_test, model)
#### Explanation of the Code
1. Libraries
2. Data Preparation
3. Train-Test Split
4. Model Training
5. Predictions
6. Evaluation.
7. Visualization: We plot the decision boundary to visualize how the KNN classifier separates the classes.
#### Evaluation Metrics
- Confusion Matrix: Shows the counts of true positives, true negatives, false positives, and false negatives.
- Classification Report: Provides precision, recall, F1-score, and support for each class.
#### Decision Boundary
The decision boundary plot helps to visualize how the KNN classifier separates the different classes in the feature space. KNN decision boundaries can be quite complex, reflecting the non-linear separability of the data.
KNN is intuitive and simple but can be computationally expensive, especially with large datasets, since it requires storing and searching through all training instances during prediction. The choice of \( k \) and the distance metric are critical to the model's performance.
Best Data Science & Machine Learning Resources: https://topmate.io/coding/914624
Credits: t.me/datasciencefun
ENJOY LEARNING ππ# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
# Example data (Iris dataset)
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data[:, 2:4] # Using petal length and petal width as features
y = iris.target
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Creating and training the SVM model with RBF kernel
model = SVC(kernel='rbf', C=1.0, gamma='scale', random_state=0)
model.fit(X_train, y_train)
# Making predictions
y_pred = model.predict(X_test)
# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print(f"Confusion Matrix:\n{conf_matrix}")
print(f"Classification Report:\n{class_report}")
# Plotting the decision boundary
def plot_decision_boundary(X, y, model):
h = .02 # step size in the mesh
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
plt.contourf(xx, yy, Z, alpha=0.8)
sns.scatterplot(x=X[:, 0], y=X[:, 1], hue=y, palette='bright', edgecolor='k', s=50)
plt.xlabel('Petal Length')
plt.ylabel('Petal Width')
plt.title('SVM Decision Boundary')
plt.show()
plot_decision_boundary(X_test, y_test, model)
#### Explanation of the Code
1. Importing Libraries
2. Data Preparation
3. Train-Test Split
4. Model Training: We create an SVC model with an RBF kernel (kernel='rbf'), regularization parameter C=1.0, and gamma parameter set to 'scale', and train it using the training data.
5. Predictions: We use the trained model to predict the species of iris flowers for the test set.
6. Evaluation: We evaluate the model using accuracy, confusion matrix, and classification report.
7. Visualization: Plot the decision boundary to visualize how the SVM separates the classes.
#### Decision Boundary
The decision boundary plot helps to visualize how the SVM model separates the different classes in the feature space. The SVM with an RBF kernel can capture more complex relationships than a linear classifier.
SVMs are powerful for high-dimensional spaces and effective when the number of dimensions is greater than the number of samples. However, they can be memory-intensive and require careful tuning of hyperparameters such as the regularization parameter \(C\) and kernel parameters.
Cracking the Data Science Interview
ππ
https://topmate.io/analyst/1024129
Credits: t.me/datasciencefun
ENJOY LEARNING ππ# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
# Example data
data = {
'Age': [25, 45, 35, 50, 23, 37, 32, 28, 40, 27],
'Income': [50000, 60000, 70000, 80000, 20000, 30000, 40000, 55000, 65000, 75000],
'Years_Experience': [1, 20, 10, 25, 2, 5, 7, 3, 15, 12],
'Loan_Approved': [0, 1, 1, 1, 0, 0, 1, 0, 1, 1]
}
df = pd.DataFrame(data)
# Independent variables (features) and dependent variable (target)
X = df[['Age', 'Income', 'Years_Experience']]
y = df['Loan_Approved']
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Creating and training the gradient boosting model
model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=0)
model.fit(X_train, y_train)
# Making predictions
y_pred = model.predict(X_test)
# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print(f"Confusion Matrix:\n{conf_matrix}")
print(f"Classification Report:\n{class_report}")
# Feature importance
feature_importances = pd.DataFrame(model.feature_importances_, index=X.columns, columns=['Importance']).sort_values('Importance', ascending=False)
print(f"Feature Importances:\n{feature_importances}")
# Plotting the feature importances
sns.barplot(x=feature_importances.index, y=feature_importances['Importance'])
plt.title('Feature Importances')
plt.xlabel('Feature')
plt.ylabel('Importance')
plt.show()
## Explanation of the Code
1. Libraries: We import necessary libraries like numpy, pandas, sklearn, matplotlib, and seaborn.
2. Data Preparation: We create a DataFrame containing features (Age, Income, Years_Experience) and the target variable (Loan_Approved).
3. Feature and Target: We separate the features and the target variable.
4. Train-Test Split: We split the data into training and testing sets.
5. Model Training: We create a GradientBoostingClassifier model with 100 estimators (n_estimators=100), a learning rate of 0.1, and a maximum depth of 3, and train it using the training data.
6. Predictions: We use the trained model to predict loan approval for the test set.
7. Evaluation: We evaluate the model using accuracy, confusion matrix, and classification report.
8. Feature Importance: We compute and display the importance of each feature.
9. Visualization: We plot the feature importances to visualize which features contribute most to the model's predictions.
Cracking the Data Science Interview
ππ
https://topmate.io/analyst/1024129
Credits: t.me/datasciencefun
ENJOY LEARNING ππ# Import necessary libraries
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
import matplotlib.pyplot as plt
import seaborn as sns
# Example data
data = {
'Age': [25, 45, 35, 50, 23, 37, 32, 28, 40, 27],
'Income': [50000, 60000, 70000, 80000, 20000, 30000, 40000, 55000, 65000, 75000],
'Years_Experience': [1, 20, 10, 25, 2, 5, 7, 3, 15, 12],
'Loan_Approved': [0, 1, 1, 1, 0, 0, 1, 0, 1, 1]
}
df = pd.DataFrame(data)
# Independent variables (features) and dependent variable (target)
X = df[['Age', 'Income', 'Years_Experience']]
y = df['Loan_Approved']
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
# Creating and training the gradient boosting model
model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=0)
model.fit(X_train, y_train)
# Making predictions
y_pred = model.predict(X_test)
# Evaluating the model
accuracy = accuracy_score(y_test, y_pred)
conf_matrix = confusion_matrix(y_test, y_pred)
class_report = classification_report(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print(f"Confusion Matrix:\n{conf_matrix}")
print(f"Classification Report:\n{class_report}")
# Feature importance
feature_importances = pd.DataFrame(model.feature_importances_, index=X.columns, columns=['Importance']).sort_values('Importance', ascending=False)
print(f"Feature Importances:\n{feature_importances}")
# Plotting the feature importances
sns.barplot(x=feature_importances.index, y=feature_importances['Importance'])
plt.title('Feature Importances')
plt.xlabel('Feature')
plt.ylabel('Importance')
plt.show()
## Explanation of the Code
1. Libraries: We import necessary libraries like numpy, pandas, sklearn, matplotlib, and seaborn.
2. Data Preparation: We create a DataFrame containing features (Age, Income, Years_Experience) and the target variable (Loan_Approved).
3. Feature and Target: We separate the features and the target variable.
4. Train-Test Split: We split the data into training and testing sets.
5. Model Training: We create a GradientBoostingClassifier model with 100 estimators (n_estimators=100), a learning rate of 0.1, and a maximum depth of 3, and train it using the training data.
6. Predictions: We use the trained model to predict loan approval for the test set.
7. Evaluation: We evaluate the model using accuracy, confusion matrix, and classification report.
8. Feature Importance: We compute and display the importance of each feature.
9. Visualization: We plot the feature importances to visualize which features contribute most to the model's predictions.
## Evaluation Metrics
- Accuracy: The proportion of correctly classified instances among the total instances.
- Confusion Matrix: Counts of TP, TN, FP, and FN.
- Classification Report: Provides precision, recall, F1-score, and support for each class.
ENJOY LEARNING ππ
Available now! Telegram Research 2025 β the year's key insights 
