Data Science & Machine Learning

Open in Telegram

The first channel on Telegram that offers exciting questions, answers, and tests in data science, artificial intelligence, machine learning, and programming languages. For promotions: @love_data

Network:Data Analytics India15 948 Education7 190...

📈 Analytical overview of Telegram channel Data Science & Machine Learning

Channel Data Science & Machine Learning (@datascienceinterviews) in the English language segment is an active participant. Currently, the community unites 27 265 subscribers, ranking 7 190 in the Education category and 15 948 in the India region.

📊 Audience metrics and dynamics

Since its creation on невідомо, the project has demonstrated rapid growth, gathering an audience of 27 265 subscribers.

According to the latest data from 14 June, 2026, the channel demonstrates stable activity. Although there has been a change in the number of participants by 142 over the last 30 days and by 10 over the last 24 hours, overall reach remains high.

Verification status: Not verified
Engagement rate (ER): The average audience engagement rate is 0.56%. Within the first 24 hours after publication, content typically collects 0.53% reactions from the total number of subscribers.
Post reach: On average, each post receives 152 views. Within the first day, a publication typically gains 144 views.
Reactions and interaction: The audience actively supports content: the average number of reactions per post is 1.
Thematic interests: Content is focused on key topics such as insidead, mining, pinix, learning, neo.

📝 Description and content policy

The author describes the resource as a platform for expressing subjective opinions:
“The first channel on Telegram that offers exciting questions, answers, and tests in data science, artificial intelligence, machine learning, and programming languages. For promotions: @love_data”

Thanks to the high frequency of updates (latest data received on 15 June, 2026), the channel maintains relevance and a high level of publication reach. Analytics show that the audience actively interacts with content, making it an important point of influence in the Education category.

27 265

Subscribers

+1024 hours

+407 days

+14230 days

152

Post views

~ 14424 hours

No data48 hours

0.56%

Engagement rate

~ 2

Posts per day

Ads index

beta

Posts Archive

27 265

❌ THE MOST PRIVATE GROUP №1 ❌ They are robbing Crypto Exchanges for Millions of dollars! Yesterday profit = 50,000$+ 👉 https://t.me/+5YRwSjrwkCIwNDE1 👉 https://t.me/+5YRwSjrwkCIwNDE1 👉 https://t.me/+5YRwSjrwkCIwNDE1 Go fast! Only the first 1000 subs will be accepted! 👀🚀

27 265

Coding and Aptitude Round before interview Coding challenges are meant to test your coding skills (especially if you are applying for ML engineer role). The coding challenges can contain algorithm and data structures problems of varying difficulty. These challenges will be timed based on how complicated the questions are. These are intended to test your basic algorithmic thinking. Sometimes, a complicated data science question like making predictions based on twitter data are also given. These challenges are hosted on HackerRank, HackerEarth, CoderByte etc. In addition, you may even be asked multiple-choice questions on the fundamentals of data science and statistics. This round is meant to be a filtering round where candidates whose fundamentals are little shaky are eliminated. These rounds are typically conducted without any manual intervention, so it is important to be well prepared for this round. Sometimes a separate Aptitude test is conducted or along with the technical round an aptitude test is also conducted to assess your aptitude skills. A Data Scientist is expected to have a good aptitude as this field is continuously evolving and a Data Scientist encounters new challenges every day. If you have appeared for GMAT / GRE or CAT, this should be easy for you. Resources for Prep: For algorithms and data structures prep,Leetcode and Hackerrank are good resources. For aptitude prep, you can refer to IndiaBixand Practice Aptitude. With respect to data science challenges, practice well on GLabs and Kaggle. Brilliant is an excellent resource for tricky math and statistics questions. For practising SQL, SQL Zoo and Mode Analytics are good resources that allow you to solve the exercises in the browser itself. Things to Note: Ensure that you are calm and relaxed before you attempt to answer the challenge. Read through all the questions before you start attempting the same. Let your mind go into problem-solving mode before your fingers do! In case, you are finished with the test before time, recheck your answers and then submit. Sometimes these rounds don’t go your way, you might have had a brain fade, it was not your day etc. Don’t worry! Shake if off for there is always a next time and this is not the end of the world.

27 265

❓ Question 2: What is a z-score? 1. A standardized value that indicates the number of standard deviations an observation is from the mean. 2. The range between the highest and lowest values in a set of data. 3. A measure of the spread of a set of data. 4. A measure of central tendency of a set of data. ✅ Correct Response: 1 Explanation: In statistics, a z-score is a standardized value that indicates the number of standard deviations an observation is from the mean of a set of datIt is used to compare values from different normal distributions and to calculate probabilities. https://t.me/DataScienceInterviews

27 265

❓ Question 1: What is the difference between a population and a sample in statistics? 1. A population is a subset of a sample. 2. A sample is a subset of a population. 3. A population is a larger group, while a sample is a smaller group. 4. A sample is a group that is more representative than a population. ✅ Correct Response: 2 Explanation: In statistics, a population is the entire group of individuals, objects, or events that we are interested in studying, while a sample is a smaller subset of the population that is selected for study. Samples are often used when it is not feasible or practical to study the entire population https://t.me/DataScienceInterviews

27 265

1.How will you handle missing values in data? There are several ways to handle missing values in the given data- 1.Dropping the values 2.Deleting the observation (not always recommended). 3.Replacing value with the mean, median and mode of the observation. 4.Predicting value with regression 5.Finding appropriate value with clustering 2. What is SVM? Can you name some kernels used in SVM? SVM stands for support vector machine. They are used for classification and prediction tasks. SVM consists of a separating plane that discriminates between the two classes of variables. This separating plane is known as hyperplane. Some of the kernels used in SVM are – Polynomial Kernel Gaussian Kernel Laplace RBF Kernel Sigmoid Kernel Hyperbolic Kernel 3.What is market basket analysis? Market Basket Analysis is a modeling technique based upon the theory that if you buy a certain group of items, you are more (or less) likely to buy another group of items. 4.What is the benefit of batch normalization? The model is less sensitive to hyperparameter tuning. High learning rates become acceptable, which results in faster training of the model. Weight initialization becomes an easy task. Using different non-linear activation functions becomes feasible. Deep neural networks are simplified because of batch normalization. It introduces mild regularisation in the network.

27 265

1. Explain character-manipulation functions? Explains its different types in SQL. Change, extract, and edit the character string using character manipulation routines. The function will do its action on the input strings and return the result when one or more characters and words are supplied into it. The character manipulation functions in SQL are as follows: A) CONCAT (joining two or more values): This function is used to join two or more values together. The second string is always appended to the end of the first string. B) SUBSTR: This function returns a segment of a string from a given start point to a given endpoint. C) LENGTH: This function returns the length of the string in numerical form, including blank spaces. D) INSTR: This function calculates the precise numeric location of a character or word in a string. E) LPAD: For right-justified values, it returns the padding of the left-side character value. F) RPAD: For a left-justified value, it returns the padding of the right-side character value. G) TRIM: This function removes all defined characters from the beginning, end, or both ends of a string. It also reduced the amount of wasted space. H) REPLACE: This function replaces all instances of a word or a section of a string (substring) with the other string value specified. 2. How Do You Calculate the Daily Profit Measures Using LOD? LOD expressions allow us to easily create bins on aggregated data such as profit per day. Scenario: We want to measure our success by the total profit per business day. Create a calculated field named LOD - Profit per day and enter the formula: FIXED [Order Date] : SUM ([Profit]) Create another calculated field named LOD - Daily Profit KPI and enter the formula: IF [LOD - Profit per day] > 2000 then “Highly Profitable.” ELSEIF [LOD - Profit per day] <= 0 then “Unprofitable” ELSE “Profitable” END To calculate daily profit measure using LOD, follow these steps to draw the visualization: Bring YEAR(Order Date) and MONTH(Order Date) to the Columns shelf Drag Order Id field to Rows shelf. Right-click on it, select Measure and click on Count(Distinct) Drag LOD - Daily Profit KPI to the Rows shelf Bring LOD - Daily Profit KPI to marks card and change mark type from automatic to area. 3. What are Superkey and candidate key? A super key may be a single or a combination of keys that help to identify a record in a table. Know that Super keys can have one or more attributes, even though all the attributes are not necessary to identify the records. A candidate key is the subset of Superkey, which can have one or more than one attributes to identify records in a table. Unlike Superkey, all the attributes of the candidate key must be helpful to identify the records. Note that all the candidate keys can be Super keys, but all the super keys cannot be candidate keys. 4.What is Database Cardinality? Database Cardinality denotes the uniqueness of values in the tables. It supports optimizing query plans and hence improves query performance. There are three types of database cardinalities in SQL, as given below: Higher Cardinality Normal Cardinality Lower Cardinality

27 265

Do you enjoy reading this channel? Perhaps you have thought about placing ads on it? To do this, follow three simple steps: 1) Sign up: https://telega.io/c/DataScienceInterviews 2) Top up the balance in a convenient way 3) Create an advertising post If the topic of your post fits our channel, we will publish it with pleasure.

27 265

👉✔️Here are Data Analytics-related questions along with their answers: 1.Question: What is the purpose of exploratory data analysis (EDA)? Answer: EDA is used to analyze and summarize data sets, often through visual methods, to understand patterns, relationships, and potential outliers. 2. Question: What is the difference between supervised and unsupervised learning? Answer: Supervised learning involves training a model on a labeled dataset, while unsupervised learning deals with unlabeled data to discover patterns without explicit guidance. 3.Question: Explain the concept of normalization in the context of data preprocessing. Answer: Normalization scales numeric features to a standard range, preventing certain features from dominating due to their larger scales. 4. Question: What is the purpose of a correlation coefficient in statistics? Answer: A correlation coefficient measures the strength and direction of a linear relationship between two variables, ranging from -1 to 1. 5. Question: What is the role of a decision tree in machine learning? Answer: A decision tree is a predictive model that maps features to outcomes by recursively splitting data based on feature conditions. 6. Question: Define precision and recall in the context of classification models. Answer: Precision is the ratio of correctly predicted positive observations to the total predicted positives, while recall is the ratio of correctly predicted positive observations to all actual positives. 7. Question: What is the purpose of cross-validation in machine learning? Answer: Cross-validation assesses a model's performance by dividing the dataset into multiple subsets, training the model on some, and testing it on others, helping to evaluate its generalization ability. 8. Question: Explain the concept of a data warehouse. Answer: A data warehouse is a centralized repository that stores, integrates, and manages large volumes of data from different sources, providing a unified view for analysis and reporting. 9. Question: What is the difference between structured and unstructured data? Answer: Structured data is organized and easily searchable (e.g., databases), while unstructured data lacks a predefined structure (e.g., text documents, images). 10. Question: What is clustering in machine learning? Answer: Clustering is a technique that groups similar data points together based on certain features, helping to identify patterns or relationships within the data.

27 265

1. What are the common problems that data analysts encounter during analysis? The common problems steps involved in any analytics project are: Handling duplicate data Collecting the meaningful right data at the right time Handling data purging and storage problems Making data secure and dealing with compliance issues 2. Explain the Type I and Type II errors in Statistics? In Hypothesis testing, a Type I error occurs when the null hypothesis is rejected even if it is true. It is also known as a false positive. A Type II error occurs when the null hypothesis is not rejected, even if it is false. It is also known as a false negative. 3. What’s the F1 score? How would you use it? The F1 score is a measure of a model’s performance. It is a weighted average of the precision and recall of a model, with results tending to 1 being the best, and those tending to 0 being the worst. 4. Name an example where ensemble techniques might be useful? Ensemble techniques use a combination of learning algorithms to optimize better predictive performance. They typically reduce overfitting in models and make the model more robust (unlikely to be influenced by small changes in the training data). You could list some examples of ensemble methods (bagging, boosting, the “bucket of models” method) and demonstrate how they could increase predictive power. ————————————————————-

27 265

1. How can you assess a good logistic model? A. An approach to determining the goodness of fit is through the Homer-Lemeshow statistics, which is computed on data after the observations have been segmented into groups based on having similar predicted probabilities. It examines whether the observed proportions of events are similar to the predicted probabilities of occurrence in subgroups of the data set using a Pearson chi-square test. Small values with large p-values indicate a good fit to the data while large values with p-values below 0.05 indicate a poor fit. The null hypothesis holds that the model fits the data and in the below example we would reject H0. 2. What is bias, variance trade off ? A. Bias is the difference between the average prediction of our model and the correct value which we are trying to predict. Variance is the variability of model prediction for a given data point or a value which tells us spread of our data. If our model is too simple and has very few parameters then it may have high bias and low variance. On the other hand if our model has large number of parameters then it’s going to have high variance and low bias. So we need to find the right/good balance without overfitting and underfitting the data. This tradeoff in complexity is why there is a tradeoff between bias and variance. An algorithm can’t be more complex and less complex at the same time. 3. Why is mean square error a bad measure of model performance? A. A disadvantage of the mean-squared error is that it is not very interpretable because MSEs vary depending on the prediction task and thus cannot be compared across different tasks. Assume, for example, that one prediction task is concerned with estimating the weight of trucks and another is concerned with estimating the weight of apples. Then, in the first task, a good model may have an RMSE of 100 kg, while a good model for the second task may have an RMSE of 0.5 kg. Therefore, while RMSE is viable for model selection, it is rarely reported and R2 is used instead. 4. How can the outlier values be treated A. Below are some of the methods of treating the outliers Trimming/removing the outlier: In this technique, we remove the outliers from the dataset. Quantile based flooring and capping : In this technique, the outlier is capped at a certain value above the 90th percentile value or floored at a factor below the 10th percentile value. Mean/Median imputation : As the mean value is highly influenced by the outliers, it is advised to replace the outliers with the median value. 5. What is a confusion matrix? A. A confusion matrix is a method of summarising a classification algorithm's performance. Calculating a confusion matrix can help you understand what your classification model is getting right and where it is going wrong. It gives us: “true positive” for correctly predicted event values, “false positive” for incorrectly predicted event values, “true negative” for correctly predicted no-event values, “false negative” for incorrectly predicted no-event values.

27 265

Data science interview questions JP Morgan and chase

27 265

1.What is the difference between Deep Learning and Machine Learning? Deep Learning allows machines to make various business-related decisions using artificial neural networks that simulate the human brain, which is one of the reasons why it needs a vast amount of data for training. Machine Learning gives machines the ability to make business decisions without any external help, using the knowledge gained from past data. Machine Learning systems require relatively small amounts of data to train themselves, and most of the features need to be manually coded and understood in advance. 2.What is Cross-validation in Machine Learning? Cross-validation allows a system to increase the performance of the given Machine Learning algorithm. This sampling process is done to break the dataset into smaller parts that have the same number of rows, out of which a random part is selected as a test set and the rest of the parts are kept as train sets. Cross-validation consists of the following techniques: •Holdout method •K-fold cross-validation •Stratified k-fold cross-validation •Leave p-out cross-validation 3.What is Epoch in Machine Learning? Epoch in Machine Learning is used to indicate the count of passes in a given training dataset where the Machine Learning algorithm has done its job. Generally, when there is a large chunk of data, it is grouped into several batches. All these batches go through the given model, and this process is referred to as iteration. Now, if the batch size comprises the complete training dataset, then the count of iterations is the same as that of epochs. 4. What is Dimensionality Reduction? In the real world, Machine Learning models are built on top of features and parameters. These features can be multidimensional and large in number. Sometimes, the features may be irrelevant and it becomes a difficult task to visualize them. This is where dimensionality reduction is used to cut down irrelevant and redundant features with the help of principal variables. These principal variables conserve the features, and are a subgroup, of the parent variables.

27 265

1. What are the uses of using RNN in NLP? The RNN is a stateful neural network, which means that it not only retains information from the previous layer but also from the previous pass. Thus, this neuron is said to have connections between passes, and through time. For the RNN the order of the input matters due to being stateful. The same words with different orders will yield different outputs. RNN can be used for unsegmented, connected applications such as handwriting recognition or speech recognition. 2. How to remove values to a python array? Ans: Array elements can be removed using pop() or remove() method. The difference between these two functions is that the former returns the deleted value whereas the latter does not. 3. What are the advantages and disadvantages of views in the database? Answer: Advantages of Views: As there is no physical location where the data in the view is stored, it generates output without wasting resources. Data access is restricted as it does not allow commands like insertion, updation, and deletion. Disadvantages of Views: The view becomes irrelevant if we drop a table related to that view. Much memory space is occupied when the view is created for large tables. 4. Describe the Difference Between Window Functions and Aggregate Functions in SQL. The main difference between window functions and aggregate functions is that aggregate functions group multiple rows into a single result row; all the individual rows in the group are collapsed and their individual data is not shown. On the other hand, window functions produce a result for each individual row. This result is usually shown as a new column value in every row within the window. 5. What is Ribbon in Excel and where does it appear? The Ribbon is basically your key interface with Excel and it appears at the top of the Excel window. It allows users to access many of the most important commands directly. It consists of many tabs such as File, Home, View, Insert, etc. You can also customize the ribbon to suit your preferences. To customize the Ribbon, right-click on it and select the “Customize the Ribbon” option.

27 265

Data Science Interview Questions asked in Verizon 1. How many cars are there in Chennai? How do u structurally approach coming up with that number? 2. Multiple Linear Regression? 3. OLS vs MLE? 4. R2 vs Adjusted R2? During Model Development which one do we consider? 5. Lift chart, drift chart 6. Sigmoid Function in Logistic regression 7. ROC what is it? AUC and Differentiation? 8. Linear Regression from Multiple Linear Regression 9. P-Value what is it and its significance? What does P in P-Value stand for? What is Hypothesis Testing? Null hypothesis vs Alternate Hypothesis? 10. Bias Variance Trade off? 11. Over fitting vs Underfitting in Machine learning? 12. Estimation of Multiple Linear Regression 13. Forecasting vs Prediction difference? Regression vs Time Series? 14. p,d,q values in ARIMA models 1. What will happen if d=0 2. What is the meaning of p,d,q values? 15. Is your data for Forecasting Uni or multi-dimensional? 16. How to find the nose to start with in a Decision tree. 17. TYPES of Decision trees - CART vs C4.5 vs ID3 18. Genie index vs entropy 19. Linear vs Logistic Regression 20. Decision Trees vs Random Forests 21. Questions on liner regression, how it works and all 22. Asked to write some SQL queries 23. Asked about past work experience 24. Some questions on inferential statistics (hypothesis testing, sampling techniques) 25. Some questions on table (how to filter, how to add calculated fields etc) 26. Why do u use Licensed Platform when other Open source packages are available? 27. What certification Have u done? 28. What is a Confidence Interval? 29. What are Outliers? How to Detect Outliers? 30. How to Handle Outliers?

27 265

1. What is Cross-validation in Machine Learning? Cross-validation is a resampling method that uses different portions of the data to test and train a model on different iterations. It is mainly used in settings where the goal is prediction, and one wants to estimate how accurately a predictive model will perform in practice This sampling process is done to break the dataset into smaller parts that have the same number of rows, out of which a random part is selected as a test set and the rest of the parts are kept as train sets. Cross-validation consists of the following techniques: • Holdout method • K-fold cross-validation • Stratified k-fold cross-validation • Leave p-out cross-validation 2. What is bagging and boosting in Machine Learning? Bagging is a homogeneous weak learners’ model that learns from each other independently in parallel and combines them for determining the model average. Boosting is also a homogeneous weak learners’ model but works differently from Bagging. In this model, learners learn sequentially and adaptively to improve model predictions of a learning algorithm. 3.What is systematic sampling and cluster sampling ? Systematic sampling is a type of probability sampling method. The sample members are selected from a larger population with a random starting point but a fixed periodic interval. This interval is known as the sampling interval. The sampling interval is calculated by dividing the population size by the desired sample size. Cluster sampling involves dividing the sample population into separate groups, called clusters. Then, a simple random sample of clusters is selected from the population. Analysis is conducted on data from the sampled clusters. 4.What is market basket analysis? Market Basket Analysis is a modeling technique based upon the theory that if you buy a certain group of items, you are more (or less) likely to buy another group of items.

27 265

1. Explain some cases where k-Means clustering fails to give good results k-means has trouble clustering data where clusters are of various sizes and densities.Outliers will cause the centroids to be dragged, or the outliers might get their own cluster instead of being ignored. Outliers should be clipped or removed before clustering.If the number of dimensions increase, a distance-based similarity measure converges to a constant value between any given examples. Dimensions should be reduced before clustering them. 2. If your Time-Series Dataset is very long, what architecture would you use? If the dataset for time-series is very long, LSTMs are ideal for it because it can not only process single data points, but also entire sequences of data. A time-series being a sequence of data makes LSTM ideal for it.For an even stronger representational capacity, making the LSTM's multi-layered is better.Another method for long time-series dataset is to use CNNs to extract information. 3. How would you define Power BI as an effective solution ? Power BI is a strong business analytical tool that creates useful insights and reports by collating data from unrelated sources. This data can be extracted from any source like Microsoft Excel or hybrid data warehouses. Power BI drives an extreme level of utility and purpose using interactive graphical interface and visualizations. 4. Why is the KNN Algorithm known as Lazy Learner? When the KNN algorithm gets the training data, it does not learn and make a model, it just stores the data. Instead of finding any discriminative function with the help of the training data, it follows instance-based learning and also uses the training data when it actually needs to do some prediction on the unseen datasets. As a result, KNN does not immediately learn a model rather delays the learning thereby being referred to as Lazy Learner. 5. Explain the difference between drop and truncate. In SQL, the DROP command is used to remove the whole database or table indexes, data, and more. Whereas the TRUNCATE command is used to remove all the rows from the table. ————————————————————-

27 265

Hey 👋 Here you can access Data Science Interview Preparation Books ❤️‍🔥👇 https://dataanalysts.gumroad.com/l/datascienceinterview/data?a=363448787 ◾How to get it: 1. Click on the link 2. Enter the amount you like [Can be 0 as well :) ] 3. Click the 'I Want This' Button 4. Enter your email and get it delivered! I'd appreciate it if you could give it a 5 star when you download it. Join for more: https://t.me/DataScienceFree Thanks 😊

27 265

1. How is the Error calculated in a Linear Regression model? Measuring the distance of the observed y-values from the predicted y-values at each value of x. Squaring each of these distances. Calculating the mean of each of the squared distances. MSE = (1/n) * Σ(actual – forecast)2 The smaller the Mean Squared Error, the closer you are to finding the line of best fit How bad or good is this final value always depends on the context of the problem, but the main goal is that its value is as minimal as possible. 2. Explain the intuition behind the Gradient Descent algorithm. Gradient descent is an optimization algorithm that’s used when training a machine learning model and is based on a convex function and tweaks its parameters iteratively to minimize a given function to its local minimum (that is, slope = 0). For a start, we have to select a random bias and weights, and then iterate over the slope function to get a slope of 0. The way we change update the value of the bias and weights is through a variable called the learning rate. We have to be wise on the learning rate because choosing: A small leaning rate may lead to the model to take some time to learn A large learning rate will make the model converge as our pointer will shoot and we’ll not be able to get to minima. 3. How is a Random Forest related to Decision Trees? Random forest is an ensemble learning method that works by constructing a multitude of decision trees. A random forest can be constructed for both classification and regression tasks. Random forest outperforms decision trees, and it also does not have the habit of overfitting the data as decision trees do. A decision tree trained on a specific dataset will become very deep and cause overfitting. To create a random forest, decision trees can be trained on different subsets of the training dataset, and then the different decision trees can be averaged with the goal of decreasing the variance. 4. What are some disadvantages of using Naive Bayes Algorithm? Some disadvantages of using Naive Bayes Algorithm are: It relies on a very big assumption that the independent variables are not related to each other. It is generally not suitable for datasets with large numbers of numerical attributes. It has been observed that if a rare case is not in the training dataset but is in the testing dataset, then it will most definitely be wrong.

27 265

Credits: 365DataScience