Loading ...

SeekACE Academy . 1st Jul, 2021


According to new data from Glassdoor, data scientist jobs are the highest-paying jobs within the U.S. LinkedIn rates data scientists’ jobs amongst the top 10 jobs in the United States. Data Science profiles have grown over 400 times in the past year as quoted by ‘The Economic Times’. With the upcoming of Thousands of start-ups in India in the last 3 years, the demand for data scientists has shot up in India as well. Therefore, it can be rightly concluded that data science jobs are the hottest jobs in today’s scenario and Data Scientists are the rock stars of this era. Data Science isn’t a simple field to urge into. This is something all data scientists will agree on. Apart from having a degree in mathematics/statistics or engineering, data scientists need to be technically proficient in programming skills, statistics knowledge, and machine learning techniques.

So, if you would like to start out your career as a profound data Scientist, you must be wondering what kind of questions you might have to face in the Data Science interview. Here’s an inventory of the foremost popular data science interview questions you’ll expect to face. With the help of these questions, you can aim to impress potential employers, by knowing about your subject and being able to show them the practical implications of data science.


1) What is Data Science?

Data science is an inter-disciplinary field that uses scientific methods to collect, analyse and interpret large amount of unstructured data using sophisticated algorithms and complicated tools such as mining, statistics, machine learning, analytics and programming.  It plays a vital role in helping companies reduce costs, detect patterns and trends, predict consumer behaviour, find new markets and make better decisions.

2) Explain what is meant by supervised and unsupervised learning in data? Also mention their types?

• Supervised machine learning: This model uses previous historical data to understand behaviour and predict outcomes for unforeseen data. Just like learning in the presence of a teacher, this learning algorithm learns from labelled data, that is some data which is already tagged with the correct answer and is therefore suitable in various types of real-world computation problems. For example, if we want to find out how much time it takes to reach home from office, the algorithm will use a set of labelled data such as weather conditions, time of the day, holidays etc. to find the answer. Regression and classification are two sorts of supervised machine learning.

• Unsupervised machine learning: This type of ML algorithm uses unclassified or unlabelled parameters and does not require us to supervise the model. It focuses on discovering hidden structures from unlabelled data to perform more complex processing tasks as compared to supervised learning however unsupervised learning can be more unpredictable as compared to other learning techniques. Clustering and Association are two sorts of unsupervised learning.

3)  Which language is most suitable for text analysis python or R?

Often programmers are faced with a dilemma whether to use Python or R. Python is preferred by developers for data analysis or to apply statistical tools whereas R is chosen by engineers, statisticians or scientists who do not possess computer programming skills.

4) What are the data types used in Python?

Python has the following built-in data types:

  • Python Numbers- Including Integers, Floating point numbers and complex Numbers.
  • Python List- It is an ordered sequence of items.
  • Python Strings- It is a sequence of Unicode characters.
  • Python Tuple- They are same as list but cannot be modified.
  • Python Set- It is an unordered collection of unique items.
  • Python Dictionary- It is an unordered collection of Key-value pairs and is useful for retrieving huge amount of data with help of Key.

5) What Are the Types of Biases That Can Occur During Sampling?

The various types of biases occurring in research are-

  • Selection Bias
  • Confirmation bias
  • Outliers
  • Overfitting and underfitting
  • Confounding variables.

6) What is Survivorship Bias?

Survivorship Bias a form of selection bias and is the logical error of focusing on successful people, businesses, strategies or aspects and casually overlooking those that did not work because of their lack of prominence. This can lead to distorted facts and produce wrong conclusions.

7) What’s selection bias?

Selection bias or sampling bias is an experimental error which occurs when the sample data that’s gathered and ready for modeling isn’t representative of actual, future population of cases the model will see. That is, active selection bias occurs when a subset of the info is systematically (i.e., non-randomly) excluded from analysis.” it’s also called as selection effect.

There are various sorts of selection bias such as:

  • Sampling bias: it’s a bias which occurs thanks to a scientific error when a non-random sample of a population is chosen and a few members of the population are excluded likely.
  • Time Interval bias: an attempt could also be terminated early at an extreme value (often for ethical reasons), but the acute value is probably going to be reached by the variable with the most important variance, albeit all variables have an identical mean.
  • Data: When specific subsets of knowledge are chosen to support a conclusion or rejection of bad data on arbitrary grounds.
  • Attrition: Attrition bias may be a quite selection bias caused by attrition, that’s tests that didn’t run to completion thanks to loss of participants.

8) Explain how a ROC curve works?

AUC- ROC curve is probability curve with graphical representation of the contrast between true positive rates and false-positive rates at various thresholds. Its AUC represents the degree or measure of separability and tells, what proportion model is capable of distinguishing between classes. It is used in many areas such as medicine, radiology, natural hazards and machine learning.

9) Explain Star Schema?

It is a standard database schema with a central table. Satellite tables map IDs to physical names or descriptions and may be connected to the central fact table using the ID fields; these tables are referred to as lookup tables and are principally useful in real-time applications, as they save tons of memory. In Business process data, star schema refers to model that holds the quantitative data of business which is distributed in form of tables, and dimensions which describes characteristics associated with fact data like Sales price, sale quantity, distant, weight, weight measurements, speed etc.

10)Which technique is used to predict categorial responses?

Classification Algorithms are ML techniques required for predicting categorial outcomes.

11) What is logistic regression? Or State an example once you have used logistic regression recently?

Logistic Regression often referred as logit model is appropriate regression analysis to predict the binary outcome from a linear combination of predictor variables. For example, if you would like to predict whether a specific politician will win the election or not. In this case, the result of prediction is binary i.e. 0 or 1 (Win/Lose). The predictor variables here would be the quantity of cash spent for election campaigning of a specific candidate, the quantity of your time spent in campaigning, etc.

12)Is it possible to perform logistic regression with Microsoft Excel?

Yes, Microsoft Excel is a very powerful tool and it is possible to perform logistic regression with it.

13) How are you able to assess an honest logistic model?


There are several metrics to assess the performance of logistic regression analysis-

  • Classification accuracy Matrix is used when there are equal number of samples belonging to each class, they are generally correct but give a false sense of achieving high accuracy.
  • Confusion matrix as the word denotes means confusion in the ways in which your classification model behaves when it makes predictions and can be easily applied to problems with three or more classes as well.
  • AUC means the area under the ROC curve and is important as it determines the models’ capability to distinguish between classes. So, Higher the AUC, better the model at distinguishing between patients with the disease and patients without the disease. It always has a range of (0,1).
  • F1 score refers to the harmonic mean between precision and recall and is a good instrument to measure a test’s accuracy. Greater F1 Score indicates better performance of the model.
  • Logarithmic Loss is suitable for multi-class classification and works on the model by penalizing the false classifications so minimising Log Loss gives greater accuracy of the model.
  • Mean Absolute Error shows the average of the difference between the Original Values and the Predicted Values and tells us the measure of how far the predictions were from the actual output, but fail to give an idea of the direction of the error to signify whether we are under predicting or over predicting the data.
  • An MSE model or Mean Squared Error model facilitates easier computation of Gradient and works by taking the average of the square of the difference between the original values and the predicted values.

14)  What are recommender systems?

A recommendation engine is a subclass of information filtering systems which uses data analysis of history of users and behaviour of users to predict the preferences of ratings that a user would give to a product. Recommender systems are widely used in movies, news, social tags, music etc.

15) What is Collaborative filtering?

This process of filtering is unsupervised learning used by most of the recommender systems to find patterns or information by collaborating viewpoints, various data sources and multiple agents. Most websites such as Amazon, Netflix, You tube etc. use collaborative filtering for their recommendations.

16) What are the various steps involved in an analytics project?

  • Understanding the business problem that is clearly defining what objectives the company wants to achieve with this project.
  • Obtaining and Extracting the necessary data and becoming familiar with it.
  • Polishing the data for modelling by detecting outliers, treating missing values, transforming variables, etc. in order to avoid distorted results.
  • After data preparation, the collected data is modelled, treated and analysed using regression formulas and algorithms to predict future results. Repeated execution of this step is done until we achieve the best possible outcome.
  • The next step is to interpret the data and gather insights to find generalisations and patterns to apply to future data.
  • Finally implementing the model and tracking the results to evaluate the performance of the model over the period of time.

17) Why data cleaning plays a vital role in analysis?

Data Cleaning from multiple sources to transform it into a format that data analysts can work with, is a cumbersome and unwise process but is crucial for analysts as data is always available in raw form and ML algorithms cannot work on raw data and need well labelled data. As number of data sources increases, the time take to clean the data increases exponentially and might take up to 80% of the time for just cleaning data making it a critical part of analysis task.

18) What do you understand by linear regression?

Linear regression is one of the most well-known and well understood algorithms in ML which helps in understanding the linear relationship between the dependent and the independent variables.

It is a supervised learning algorithm and belongs to both Statistics and ML. One is the predictor or the independent variable and the other is the response or the dependent variable. In Linear Regression, we try to understand how the dependent variable changes with respect to the independent variable.

There are two types of Linear Regression, called simple linear regression, and multiple linear regression.

19) What are the assumptions required for linear regression?

There are four major assumptions:

  1. There is a linear and additive relationship between the dependent variables(response) and the regressors meaning the model you are creating actually fits the data.
  2. The errors or residuals of the data must be normally distributed and independent from each other that is they should not be correlated.
  3. It assumes that there is minimal multicollinearity between explanatory variables.
  4. Homoscedasticity means “having the same scatter” that is variance around the regression line is the same for all values of the predictor variable.

20) What is the difference between Regression and classification ML techniques.

Regression, as well as classification, are both machine learning techniques and come under supervised machine learning. The process of classification involves discovering a model or function which helps in separating the data into multiple categorical classes i.e. discrete values. Whereas Regression involves finding a model or function for distinguishing the data into continuous real values instead of using classes or discrete values. The nature of predicted data in classification is unordered as compared to ordered predicted data in Regression. Various examples of algorithms in classification are Decision tree, Logistic Regression etc whereas Random Forest, Linear Regression is examples of Regression algorithms.

Shopping Cart