Whether you are a fresher or an experienced professional in transition, the interview for your next data science job can be daunting. What would they ask? How difficult would the questions be? How do I answer them? If the thought of data science interview questions is giving you sleepless nights, let us help you. In this blog post, Mitesh Gupta, a senior data scientist and Springboard mentor, lists the top 15 questions you might get asked at interviews for data science jobs and suggests the best ways to answer them. This is second in the series of the most asked data science interview questions in India as shared by real-life data scientists. Don’t forget to check out the first set of interview questions in our previous blog.
Data Science Interview Questions and Answers
As an emerging field, data science is opening up thousands of data science jobs across the country. Naturally, there are not enough qualified candidates for these positions. Fresh graduates and early career professionals make up a vast majority of candidates. If you are one of them, remember that your interviewer knows that you’re not an expert. Unless you’re applying for leadership positions, they simply expect you to have a strong foundational understanding. As long as you can demonstrate analytical thinking and a robust approach to solving problems, you are all set. The answer to ‘how to crack a data science interview’ is: Know the pros and cons of each data science approach to problem solving, and show how your approach is the best for the problem at hand. Let’s get to the top 15 data science interview questions now.
#1: What are the different types of regression?
Regression is one of the first models any aspiring data scientist will apply to practice. In simple terms, regression is the relationship between the input variable (also known as a predictor) and the output variable (response). There are several types of regression models. Let’s see the most commonly used seven.
- Linear regression is the process of fitting a linear straight line as the relationship between input and output variables. It is typically applicable to continuous output variables.
- Logistic regression is used to calculate the probability of an event. It is typically used when the output variable is a discrete variable, and while targeting a classification problem.
- Polynomial regression is fitting a non-linear relationship between input and output variables. This is applied by increasing the power of the input variable, this fits a curve instead of a straight line to the model.
- Stepwise regression is used when you have multiple input variables or when the selection of independent variables is an automatic process. The model will automatically identify independent variables on the basis of the R-Squared or Akaike information criterion (AIC) metric.
- Ridge regression is used when there is either high variance or multicollinearity, which uses L2 regularisation techniques to minimise errors.
- Lasso regression is similar to ridge regression. It adds the absolute term as the penalty function using the L1 type of regularisation to minimise error.
- ElasticNet regression is a hybrid of ridge and lasso methods. It uses L1 and L2 regularisation, especially useful for data having multiple variables.
You can watch Mitesh explain these concepts in detail in this video above.
#2: What are the assumptions of linear regression and how to verify them?
While applying any algorithm to a dataset, you need to understand the assumptions associated with it, and the corresponding pros and cons. The assumptions of basic linear regression, which you need to perform as part of data exploration, are:
- There should be a linear relationship between input and output variables i.e. the model must fit in a straight line. You can verify this using a residual plot
- There should be no multicollinearity i.e. input variables should not have a moderate or high correlation between them, which will result in a low confidence interval. You can verify using the VIF (variance inflation factor) test
- There should not be any auto-correlation in the data i.e no correlation between error terms. This is commonly used in time-series data, where data for one period might be dependent on the data for the previous period. In linear regression, the residuals should not have a correlation. You can verify this with residual vs. time plots, or Durbin–Watson statistic test
- There should be no homoscedasticity i.e the variance in the residual should be constant. This issue typically occurs due to outliers in data. You can verify using the residual vs. fitted plot. If the plot is in a funnel shape, you must remove the outliers
- There should be multivariate normality i.e. the residuals should be normally distributed.
#3: How does a linear regression algorithm determine what are the best coefficient values?
The error term between predicted and actual values determines whether the progression algorithm will be the best fit line or not. For this, we define the cost function. In linear regression, cost function = sum of squared errors. Model with minimal cost function will be the best model.
#4: Data Science Interview Questions: What is R-Squared, its formula and definition?
R-Squared is an important metric in evaluating any regression problem. Simply put, it is the proportion of the response variable explained by the input variable. In other words, it is the percentage of variation explained by the model.
R2 = Variance explained by model / Total variance.
R-Square = 1 – (SSreg / SStotal)
#5: What problems arise if the distribution of new (unseen) test data is significantly different from the distribution of training data?
Misclassification will occur; predictions will be wrong. This happens for basically three reasons:
- Selection bias: When data is static and sampling is not random
- Population drift: When data is non-static, but training is done on one type of population while testing in another type
- Non-stationary environment: When you’re training in one type of environment and testing in another. For instance, if you’re training your model on solar power plant datasets from summer and testing in the data from the rainy season.
#6: What are bias and variance? How are they related to the modelling of data?
This is an important concept you need to understand for excelling at data scientist jobs. Because it is on the basis of bias and variance that we improve machine learning models. Bias is the difference between the average prediction of our model and the correct value. Variance is the variability of model prediction for a given data point or a value which tells us the spread of data. Basis and variance play a critical role in prediction errors.
High bias will hinder the model’s ability to learn from training data. It might not learn the assumptions. This will result in underfitting and we will not be able to identify patterns. The high variance will result in the model learning from the noise in the dataset as well. This will result in overfitting. The prediction for training will be high. But, while testing, when unseen data comes, the predictions will have high error rates. While modelling data, you need to find optimum bias-variance tradeoff.
#7: Data Science Interview Questions: How does the decision tree work? How to split the nodes of a decision tree?
Decision trees are supervised learning models, which work for classification and regression problems. They can be of two types: Categorical and regression decision tree. We try to split data on the best possible node, using the Gini Index or information tree.
#8: What error metric would you use to evaluate how good a binary classifier is? What if there is a class-imbalance problem?
When we deal with classification problems, we always see the results on the basis of confusion metric.
|Predicted values||Positive (1)||Negative (0)|
|Predicted values||Negative (0)||False-negative||True-negative|
If you take a binary classifier scenario like the above,
Accuracy = (True-positive + True-negative) / total records
When class imbalance problems persist, you can’t go by accuracy. So, you need to look for precision, recall, F1 score and ROC, as calculated below.
- Precision = True-positive / Predicted positive
- Recall = True-positive / Actual positive
- F1 score = 2*(precision*recall) / precision + recall (This is also known as the harmonic mean of precision and recall)
- ROC = Plot of false-positive rate / True-positive rate
#9: What is the difference between bagging and boosting models?
Bagging is an ensemble model. In bagging, we consider multiple models and combine their output to build a single prediction model. We use a bootstrap method for sampling the original dataset. A random forest might be seen as a bagging model.
Boosting is a model to convert weak learners into stable predictors. This is done by taking a subset of data, giving equal weightage to all records/variables, fitting a base model, and predicting on it. This might have high errors. In the next model, we give high weightage to the error terms in the previous model, optimising it sequentially.
#10: What are the different types of sampling methods?
Different types of sampling methods are as follows –
- Bernoulli samples
- Cluster samples
- Systematic sampling
- Random sampling
- Stratified sampling
#11: What is cross-validation? What is n-fold cross-validation?
Cross-validation is a technique to validate if a model is overfitting or not. In this model, you reserve a small set as a validation dataset, used only for testing. N-fold cross-validation divides data into ‘n’ number of subsets. We’ll do ‘n’ number of iterations, each time using one of the subsets as a testing dataset. We’ll model based on the average of all subsets, which helps recover from overfitting.
#12: Data Science Interview Questions: What is cross-entropy?
Cross-entropy is a concept in information theory, built on top of entropy, used to find the difference between two distributions. It is used in deep learning to optimise cost function to improve the accuracy of the model.
#13: What is multicollinearity? How to fix it in regression?
Multicollinearity occurs when we have a moderate or high correlation between input variables. If there is high multicollinearity, confidence intervals will be narrow, and we will be unable to find the correct predictor variable, which helps to predict the response variable.
It can be fixed by using the VIF (Variance Inflation Factor) statistic. Calculate the VIF for each input variable, and remove variables with high VIF from the dataset.
#14: What is the relationship between sample size and margin of error?
Sample size and margin of error have an inverse relationship i.e. if the sample size increases, the margin of error decreases. However, there is a point, after which, this phenomenon might not be significant.
#15: Data Science Interview Questions: What are type I and II errors?
Easy question for the last: Type I is false positive and Type II is a false negative.
If you’ve read this far, you must definitely be excited by the possibilities of a data science career. Have you checked out the job interview tips by data scientist and Springboard mentor Chirasmita Mallick? If you’re looking to accelerate your transition to a data science job, do consider Springboard’s online program. With 1:1 mentorship, career coaching and a job guarantee, it is one of the best data science courses in India. Check it out now!