Data science interviews can be some of the most demanding conversations you will ever come across in your career. There are a large set of questions that might be asked, and the answers can be tricky if you don’t understand the questions clearly. This is why we regularly bring Springboard mentors to identify and share the top data science interview questions and answers. In our series of live online sessions with Springboard mentors, we answer interview questions, offer career coaching and even deliver hands-on programming sessions in data science, machine learning and artificial intelligence topics. To watch previous sessions or be notified of upcoming ones, subscribe to the Springboard India Youtube channel

Top Data Science Interview Questions and Answers

In this blog post, Arihant Jain, Lead Data Scientist at ZestMoney discusses fifteen questions you might be asked at a data science interview. This is third in the series of the most asked data science interview questions and answers in India as shared by real-life data scientists. You can read part one with Chirasmita Mallick, part two with Mitesh Gupta and most asked data science interview questions in India on the Springboard blog. Let’s get started.

#1 What is the flow of supervised machine learning models? Explain with an example

This question is often asked to judge whether you have hands-on experience with machine learning models and how you would approach it. This question, or a variation of this, is most likely to be asked in all data science job interviews. 

How to Prepare for a Data Science Interview

To get this kind of data science interview questions and answers right and set the stage for a meaningful conversation, prepare in advance. Before going into an interview, think about and write down all the steps you follow. Below is an example of a good flow.

  • Understanding the problem statement.
  • Exploring metadata (evaluating how many variables are in the dataset, what are those variables, how were they sourced, etc.)
  • Loading the dataset. Depending on how big the data is, you can also talk about what tools and techniques you used. 
  • Performing data cleaning and data sanity with imputation methods, missing value methods, outliers, etc.
  • Building machine learning models. At this stage, mention 2-3 algorithms you have experience with. Start with the simple ones like logistic regression, and move to more complex ones like the random forest, naive bayes, etc.
  • Closing with metrics, discuss how you measured the performance of your model and improved it.

The goal here is to demonstrate practical experience. Give examples from your own projects, even if they are personal projects or competitions. Also, show knowledge of the domain.

#2 What is the importance of statistics and mathematics in data science?

The expectation is not that you be a mathematician or statistician, but simply to check if you know the basics of statistics and mathematics in the context of data science. The best way to answer this is through examples. 

You can say that the primary purpose of a machine learning model is to learn patterns from data. This requires the data to be normally distributed. To understand and manipulate data, one needs to be aware of concepts like mean, median, correlation, etc. Similarly, probability calculation is used in logistics regression, differentials in optimisations and so on, making maths and stats foundational in data science. 

#3 What is the assumption of logistic regression? How are they different from linear regression?

A key assumption of linear regression is that the dependent variable is linearly related to the independent variable. And the error terms are normally distributed.

These assumptions don’t necessarily apply to logistic regression. Here, the dependent variable is a categorical variable; but the linear relationship between the dependent and independent variable isn’t necessary. In fact, in logistic regressions, the linear relationship is between independent variables and logodds.

#4 How can outlier values be treated?

The outlier is the data point which is away from the normal trend of data. Any value that is lying above 2.5/3 IQR (Interquartile Range) is an outlier. When left untreated, outliers will affect the efficacy of algorithms like linear regression. The most common method used is to delete the outliers, but that’s only when no other option is plausible.

Flooring and capping methods that use the 95th or 97th percentile value to replace the outlier value often work fine.

#5 Let’s say you’re working on fraud identification and you have developed a very rich ML model. Your accuracy level is 98%. But your manager is unhappy with it. What went wrong and how can you fix it?

A fraud detection model is likely to be an imbalanced dataset, where the class of target (fraud) is very less in both volume and percentage. Therefore, accuracy doesn’t make sense as a metric. Alternatively, metrics like precision, recall, f1 score, Area Under the Precision-Recall Curve (PRAUC) etc. are more appropriate.

#6 Given 1000+ features in a dataset, what feature selection framework would you use to reduce the dimension of the dataset for a robust ML model?

It is possible to build a model with 1000 features. However, it will be an unnecessary use of time and resources to include variables that are not important with respect to the dependent variable. Therefore, data scientists use feature selection to pick only those variables that are meaningful.

One framework that’s used for this is PCA, Principal Component Analysis. This framework identifies features which explain maximum variance in the dataset. 

#7 What is the difference between R2 and adjusted R2?

R2 is used to measure the performance of a linear regression model —  to predict continuous variables such as house price prediction. R2 explains the degree to which the input variable explains the variation of the output or predicted variable. For instance, If the R2 is .8, 80% of the variation in the output variable is explained by the input variables.

When you increase the number of variables, R2 often remains the same or increases, irrespective of input variables adding statistical significance to your model or not. So, even if additional features are unimportant, R2 may increase. To handle this problem, we use the adjusted R2. It penalises variables that don’t add statistical significance to the model. So, adjusted R2 is a better and more sophisticated metric to use when you have many variables.

#8 In what scenario would you use regularisation techniques?

The purpose of a model is to generalise well on future/unseen data. However, it is very common to see models performing accurately on training data but not on unseen data. This is often a result of over-fitting. Regularisation is a technique which improves model performance on unseen data by penalising coefficients.

#9 If the Pearson correlation coefficient between 2 variables is 0, can we conclude that there is no relationship between two variables?

No, we can’t be sure of this. It is one possibility, but there may also be a non-linear relationship between variables.

#10 What is the goal of A/B testing? How can you implement it in data science?

Data science is an iterative process. You build a model, experiment, collect feedback, and iterate. A/B testing is a popular way to do this. This can be done by testing multiple models on the same population, or the same model on multiple parts of the population. This is best done in production.

#11 What is the purpose of feature scaling in distance-based algorithms?

Let’s say you’re working with data about age and income. The scale/range in which these two variables operate is completely different. The age range might be between 1 and 100; the income range between 10,000 and 1 Crore. Algorithms may mistakenly give higher weightage to variables with higher value, causing bias. 

To prevent this, we use feature scaling. It is a way to standardise variables. Some commonly used feature scaling techniques are z score standardisation, min-max scaling, etc.

#12 Can you explain a machine learning algorithm with the real-life analogy?

The easiest would be to explain the decision tree algorithm. You can pick any decision in real life and show how decision trees work. You can also extrapolate this to random forest.

#13 Why do we convert categorical features into numerical features? What techniques can be used to do this?

Machines don’t understand categorical variables. To make machines understand the variables and their values in the dataset, we need to convert them into numbers/numerical features. We can use label encoding, one-hot encoding, etc. to do this.

#14 How will you learn any new concept/algorithm given data science is changing so rapidly?

News, blogs and community forums are places you follow to stay in touch. But an important skill for a data scientist is to read, understand and apply research papers. This is what the interviewer is trying to see.

#15 Give examples of what you understand by structured thinking to solve data science problems?

Structured thinking is another key skill for a data analyst or data scientist. Problems in the real world are almost always unstructured. To solve them, you need to build your own structures. The flow mentioned in the first question is an example of a structured thinking process.

These are a few questions you might be asked in interviews for data scientist jobs. Data science is a vast field, with innumerable specialisations. To learn the fundamentals of data science and kick start your career, consider Springboard’s online data science career track program. The program allows you to connect with mentors like Arihant, get career advice from professional coaches, while also offering a job guarantee!