“Raw data is garbage, and data scientists make sense of raw data,” says Chirasmita Mallick, Senior Data Scientist at G2. So, it is natural that, as the ability to capture, store and process raw data increases, data scientist jobs would grow too. Many studies confirm this trend — from IBM to Nasscom. Yet, in a high-technology field like data science, finding a job isn’t easy. The expectations of qualification, experience and skills are high and data science interview questions are tough. To help aspiring data scientists kick start their career, Chirasmita, who is also a mentor at Springboard, lists her top 13 data science interview questions and offers advice on how to answer them. 

Data Science Interview Questions & The Best way to Answer Them

Before we explore the questions themselves, she offers one general word of advice on how to prepare for an interview: Don’t force yourself to practice strictly theoretical answers. Often, it’s better to explore the problem in a fundamental way, explaining why you’ve chosen an algorithm or how you’d overcome challenges in the data. Interviewers for data scientist jobs might be interested in your thinking/approach, rather than what you’ve learned from Google. With that in mind, let’s begin.

1. How do you handle class imbalance problems? Give an example.

Let’s say you are trying to identify if an email is a spam or not. If class 1 has a large volume of data compared to class 2, it would be difficult to build predictions. This kind of problem is more common than you think. And a simple and effective solution would be to under-sample or over-sample to balance the classes.

2. What is ensemble learning? What kinds of ensemble learning are there?

Ensemble learning takes the best code from multiple models and combines them together. This generates better predictions. If you’re unable to build good results from an approach like a decision tree or logistic regression, try the ensemble approach. 

Some kinds of ensemble learning are:

  • Weighted average
  • Stacking
  • Boosting
  • Bagging
  • Random forest

3. What do you understand by normal distribution?

This one is basic statistics. A normal distribution is a data that forms a bell curve, where mean = median = mode. Or, the bias-variance of the data is even. 

4. How is the number of clusters defined in cluster programming?

Cluster programming is aggregating similar elements together based on certain criteria. Some methods you can apply are Elbow method, gap statistic method, silhouette method etc. 

5. What is p-value in statistics?

As a data scientist, you make a null hypothesis. You need to back that claim with a statistical method. P-value, or probability value, is the claim that justifies your null hypothesis. You might not use the p-value as part of your algorithmic processing, but it’s important for stress-testing your claim.

6. What are precision and recall?

Data scientists use different methods to validate their data. Precision is the proportion of positive identification that is actually correct. Recall is the proportion of actual positives that were identified correctly. 

7. Data Science Interview Questions” What is bias-variance trade-off?

The bias in human thinking infiltrates into bias in data science models. Bias and variance are typically prediction errors. The goal of a data scientist is to reduce this. The bias-variance trade-off is the harmony or the cut-off we can take for a model to minimise the bias and variance in our data.

8. How important is data cleaning in machine learning? List the common steps.

More than algorithms or predictions, a big part of data scientist jobs goes into data cleaning. Common steps are finding answers to questions like:

  • What kind of data sources do we have?
  • How do we handle missing data?
  • How do we use imputing?
  • How do we deal with categorical data?
  • What about data from different sources?
  • How to normalise the data?
  • How to fill missing values?

9. Data Science Interview Questions: What will you do if your data has missing values?

You can either do nothing about the missing values. It comes with its risks, it might affect your results proportionately. Or, you can randomly impute missing values based on mean/median. You can also use the most frequent values. 

10. What is the basic workflow of a data scientist?

  • Finding the why of every problem. 
  • Prototyping results as fast as possible. 
  • Stress-testing results constantly.

11. Data Science Interview Questions: What do you mean by over-fitting and under-fitting?

If your model has learned a lot, extracting some of the noise too, it is over-fitting. If your model hasn’t learned enough, not capturing the underlying structure of the data, it is under-fitting. You can handle this better by mitigating bias and variance.

12. How do you validate your results in a machine learning setting?

There are two common approaches: Area under the curve and f1 score. 

13. Data Science Interview Questions: How do you stay relevant in the data science field?

This is coming to be especially important during our current struggles with COVID-19. While you’re in lockdown, waiting for the clouds to clear, here are some things you can do to stay relevant:

  • Keep track of new research and read them.
  • Pursue your pet projects, but choose them wisely.
  • Build minimum viable products. 
  • Create your own website or a GitHub page where you can demo your projects.
  • Network. Understand from your peers how they’re solving problems.
  • Engage on Kaggle, collaborate on competitions and test your skills.
  • Read about data science and related fields. Springboard’s own blog page is a good place to start.

This is only a quick summary of Chirasmita’s job interview tips. To listen to her answer these questions in detail, offer examples, explanations and career advice, watch her live session. If you’re considering learning data science, check out Springboard’s 1:1 mentoring-led, project-driven online learning program which also comes with a job guarantee.