Loading ...

SeekACE Academy . 1st Jul, 2021


41) What are the feature selection methods used to select the proper variables?

Feature selection is an important concept of ML and involves removal of irrelevant or partially relevant features which may negatively impact the performance of our model.

There are three feature selection methods –

1.Filter Methods- This method of selection of features is independent of any ML algorithm and is usually used as a pre-processing step. The feature is chosen on the basis of their statistical tests.

This process involves:

• Pearson’s correlation

• Linear discrimination analysis


• Chi-Square

Filter methods don’t remove multicollinearity therefore this must be considered before training our data models.

2.Wrapper Methods

In this method we use a subset of features and train a model using them and supported the inferences that we draw from the previous model, we decide to add or remove features from our subset. Wrapper methods are very labour-intensive, and high-end computers are needed if tons of data analysis are performed with the wrapper method.

This involves:

• Forward Selection: We test one feature at a time and keep adding them until we get an honest fit.

• Backward Selection: We test all the features and begin removing them to ascertain what works better.

• Recursive Feature Elimination: Recursively looks through all the various features and the way they pair together.

3.Embedded Methods

They are a mixture of both filter and wrapper methods and comprise of qualities of both. It’s implemented with algorithms that have their own built-in feature selection methods.

They include LASSO and RIDGE regression methods which have inbuilt penalization functions to scale back overfitting.

42) How do you deal with missing data in a dataset? What percentage of missing data is acceptable?

The problem of missing data can have significant effect on the results drawn from the data, can reduce the representations of the sample and the lost data could also cause bias in the estimation of parameters.

Now what percentage is acceptable, some studies suggest that a missing rate of 5% or less is inconsequential.

The following are ways to handle missing data values:

  • Case deletion- In this method if the number of cases having missing values is less then it is better to drop them.
  • Imputation- This means substituting the missing data by some statistical methods, it replaces the missing data by an estimated value.
  • Imputation by Mean, Mode, Median- In case of numerical data, missing values can be replaced by mean of the complete cases of the variable. If the case is suspected to have outliers, we can replace missing values by median, for categorical feature, the missing values could be replaced by the mode of the column.
  • Regression Methods- In this method the missing values are treated as dependent variables whereas complete cases are considered as predictors, these predictors are used to fit a linear equation for the observed values of the dependent variable and this equation is then used to predict values for the missing data points.
  • K -Nearest Neighbour Imputation (KNN)- In this method the k-nearest neighbour algorithms is used to estimate and replace missing data. This algorithm choses K-neighbours using some distance measures and their average is used as an imputation estimate.
  • Multiple Imputation- It is an iterative method in which multiple values are estimated for the missing data points using the distribution of the observed data.

43) What are dimensionality reduction and its benefits?

In ML there are many factors called features on which final classification is based, sometimes most of these features are correlated and hence redundant. Dimensionality reduction refers to the process of reducing these redundant features in a vast data set with fewer dimensions (fields) to convey similar information concisely. It is a very important feature of machine learning and predictive modeling.

This reduction helps in compressing data and reducing storage space hence resulting in less computation time as fewer dimensions lead to less computing. It is used in case of email in identifying whether it is spam or not. Despite its benefits it might lead to loss of some data.

44) What are the feature vectors?

A vector means an array of numbers. In machine learning, these vectors are called feature vectors as each of these values are used to represent numeric or symbolic characteristics (called features) of an object in a mathematical way that’s easy to analyse. 

45) What is sampling? How many sampling methods do you know?

Data sampling refers to a statistical analysis technique which is used to collect, manipulate and analyse data to identify patterns and trends in the case where the larger data set is being examined.

The important types of probability sampling methods are simple random sampling, stratified sampling, cluster sampling, multistage sampling, and systematic random sampling. The key benefit of these methods is that they guarantee that the sample chosen is truly representative of the population.

46) Why is resampling done?

While dealing with an imbalanced dataset one of the possible strategies is to resample either the minority or the majority class to artificially generate a balanced training set that can be used to train a machine learning model. Resampling methods are easy to use and require little mathematical knowledge.

Resampling is done to improve the accuracy of sample statistics by using subsets of accessible data, or drawing randomly with replacement from a set of data points. The two common validating models used in resampling by using random subsets are bootstrapping and cross-validation.

47) What is root cause analysis?

Root cause analysis was initially developed to analyse industrial accidents but is now widely used in other areas specially medicine. It is a problem-solving technique that helps people answer the question of why the problem occurred in the first place. This analysis assumes that systems and events are interrelated which means an action in one area triggers an action in another, and another, and so on so by tracing back these actions, we can discover where the problem started and how it grew into the symptom we are now facing. A factor is called a root cause if its deduction from the problem-fault-sequence averts the final undesirable event from recurring.

48) Explain cross-validation?

Cross-validation is a model validation technique in which we train our model using the subset of the data-set and then evaluate using the complementary subset of the data-set, this is done by reserving some portion of data set, using the rest data to train the model and finally testing the model on the reserved portion of the data set. It is very useful in backgrounds where the objective is to forecast and we have to estimate how accurately a model will accomplish in practice.

Its main goal is to term a data set to test the model in the training phase (i.e. validation data set) so as to limit problems like overfitting and determine how the model will generalize to an independent data set.

49)  What are the confounding variables?

These are extraneous variables in a statistical model that influences the outcome of an experimental design and correlates directly or inversely with both the dependent and the independent variable. Simply put it is an extra variable entered into the equation that was not accounted for and can spoil an experiment and produce erroneous results. If we take the example of research of what effect lack of exercise has on weight gain where lack of exercise is the independent variable and weight gain the dependent one, but certain other influences can also be the cause of gain in weight, these extra influences are called confounding variables. Confounding variables can be controlled by randomizing, matching and statistical control.

50) What is gradient descent?

Gradient descent is an optimization algorithm used best when the parameters cannot be calculated analytically. It helps in minimising some function, by iteratively going in the direction of steepest descent as defined by the negative of the gradient, so as to scale back loss as quickly as possible. In machine learning, gradient descent is used to update the parameters of our model.

51) Why is gradient descent stochastic in nature?

The term stochastic means random probability. Therefore, in the case of stochastic gradient descent, the samples are selected at random instead of taking the whole in a single iteration. Stochastic Gradient descent often converges much faster compared to GD but the error function is not as well minimized as in the case of GD.

52) Do, gradient descent methods always converge to the same point?

 Gradient descent converges towards a local minimum but may not necessarily reach it if it starts close enough to that minimum or if there is one minimum. If the function has multiple local minimums then gradient descent converges on which point may depend on where the algorithm starts or where the iteration starts. It is very hard to converge to a global minimum and in order to obtain the global minima, we should run the gradient descent algorithm more than once changing the learning rate. Simulated annealing does this by probabilistic perturbations.

53) What do you understand by Hypothesis in content learning of machine learning?

Hypothesis is a provisional idea, an educated guess which requires some evaluation. Specifically, supervised learning, can be described as the desire to use available data to learn a function that best maps inputs to outputs. It can also be termed as the problem of approximating an unknown target function (that we assume exists) that can best map inputs to outputs on all possible observations from the problem domain.

Therefore, hypothesis in machine learning can be rightly defined as a model that approximates the target function and performs mappings of inputs to outputs.

54) What does the cost parameter in SVM stand for?

The cost parameter in SVM controls training errors and margins and decides how much an SVM should be allowed to “bend” with the data. For example, a low cost creates large margin and allows more misclassification and you aim for a smooth decision surface but on the other hand a large cost creates a narrow margin (a hard margin) and permits fewer misclassifications enabling you to classify more points correctly.

55) What is skewed distribution? Differentiate between right skewed distribution and left skewed distribution?

When a distribution creates a curve, which is not symmetrical or in other words data points cluster on one side of the graph than the other, it is referred to as skewed distribution. There are two types of skewed distribution left and right skewed.

Left skewed or negatively skewed distribution means when there are fewer observations on the left, which means it has a very long left tail. For example, Marks obtained by students in a difficult exam.

Distributions with fewer observation on the right that is towards higher values are said to be right skewed distribution. In this distribution the graph has a long right tail. For example, Wealth of people in a country.

56) Why is vectorization considered a powerful method for optimizing numeral code?

Vectorization is a very powerful method for improving the performance of a code running on modern CPU’S. Numerous problems in the real world can be reduced to a set of simultaneous equations. Any set of linear equations in unknowns can be expressed as a vector equation (this remains true for differential equations). Vectors are not just restricted to the physical idea of magnitude and direction, their real application is to problems in fields like statistics and economics. If you want a computer program to address such problems it is much easier if the computer handles the vector manipulations, rather than getting the human programmer to code in all the detail of the calculation thus resulting in the rise of vector languages.


57) What is deep learning, and how does it contrast with other machine learning algorithms?

Deep learning is a subset of machine learning that is concerned with helping machines to solve complex problems even when using large data set that is very diverse, unstructured, semi-structured and inter-connected. In that sense, deep learning represents an unsupervised learning algorithm that learns representations of data through the utilization of neural nets.

Whereas machine learning is a supervised learning algorithm which learns from the past data and makes decisions based on what it has learnt.

58) How is conditional random field different from hidden Markov models?

Conditional random fields (CRMs) and Hidden Markov models (HMMs) are widely adopted statistical modeling methods, in the business world and are often applied for pattern recognition and machine learning problems. Conditional Random Fields (CRMs) are discriminative sequence model in nature and because of their flexible feature design can accommodate any context information. The major drawback of this model is its complex computations at the training stage, which makes it difficult to re-train the model when new data is added.  Hidden Markov Models (HMMs) are strong statistical generative models which facilitate learning directly from raw sequence data. They are capable of performing wide variety of operations such as multiple alignment, data mining and classification, structural analysis, and pattern discovery and can be easily combined into libraries.

59) What are eigenvalue and eigenvector?

Both eigenvalues and eigenvectors of a system are extremely important in data analysis. Eigenvalues are the directions along which a specific linear transformation acts by flipping, compressing, or stretching.

Eigenvectors are for understanding linear transformations and is a vector whose direction remains unchanged when a linear transformation is applied to it. In data analysis, eigenvectors are usually calculated for a correlation or covariance matrix.

Several analysis like principal component analysis (PCA), factor analysis, canonical analysis etc. are based on eigenvalues and eigenvectors.

60) What are the different types of joins? What are the differences between them?

  • There are four types of joins used in different data-oriented languages-

(INNER) JOIN: It is a behaviour to keep rows where the merge “on” value exists in both the left and right dataframes, that is each row in the two joined data frames to have matching column values which is similar to the intersection of two sets. It is one of the most common types of joins used.

LEFT (OUTER) JOIN: Keep every row in the left dataframe. Where there are missing values of the “on” variable within the right dataframe, add empty / NaN values within the result.

RIGHT (OUTER) JOIN: IT is similar to left outer join but the only difference is that all the rows of right dataframe are taken as it is and only those of the left dataframe that are common in both.

FULL (OUTER) JOIN: Return all records when there is a match in either left or the right table.

61)  What are the core components of descriptive data analytics?

Descriptive Analytics is a preliminary stage of data processing and provides. Data aggregation, Summarization and Visualization are the main pillars supporting the area of descriptive data analytics. A significant example of this is increase in Twitter followers after a particular tweet

62) How can outlier values be treated?

In both statistical and machine learning outlier detection is crucial for building an accurate model to get good results. However, detecting them might be very difficult, and is not always possible. There are three methods to deal with outliers.

  1. Univariate method- It is one of the simplest methods for detecting outliers, in this method we look for data points with extreme values on one variable.
  2. Multivariate method- Here we look for unusual combinations on all the variables.
  3. Minkowski Error-This method does not remove outliers but reduces the contribution of potential outliers in the training process.

The most common ways to treat outlier values

1) to vary the value and convey in within a range.

2) To just remove the value.

63) How can you iterate over a list and also retrieve element indices at the same time?

In programming, we use lists to store sequences of related data, this can be done using the enumerate function which involves taking every element in a sequence just like in a list and adding its location just before it.

64) Explain about the box cox transformation in regression models.

A Box cox transformation is a statistical technique often used to modify the distributional shape of the response variable so that the residuals are more normally distributed. It is done so that tests and confidence limits that require normality can be more appropriately used. However, it is not suitable in case of data containing outliers which may not be properly adjusted by this technique. If the given data is not normal then most of the statistical technique assume normality. Applying a box cox transformation means you’ll run a broader number of tests.

65)  Can you use machine learning for time series analysis?

Yes, it can be used for forecasting time series data like inventory, sales, volume etc but it depends on the applications. It is an important area of machine learning because there are multiple problems involving time components for making predictions.

66) What is Precision and Recall in ML?

Precision and recall are both extremely important to indicate accuracy of the model. Precision means the measure of your results which are relevant. On the opposite hand, Recall refers to the measure of total relevant results correctly classified by your algorithm. Both of them cannot be maximised at the same time as one comes at the cost of another. For example, in case of where there is a limited space on each webpage, and extremely limited attention span of the customer so if we are shown a lot of irrelevant results and very few relevant results we will shift to other sites or platforms.

67) What is R-square and how is it calculated?

Once a machine learning model is built, it becomes necessary to evaluate the model to find out its efficiency.

R-square is an important evaluation metrics used for linear regression problems. R-Square defines the degree to which the variance in the dependent variable (or target) can be explained by the independent variable (features) and is also known as coefficient determination. For example, if the r-square value of a model is 0.8, it means 80% of the variation in the dependent variable is explained by an independent variable. Higher the R- Square, better the model. R-square always lies between 0 and 100%.



68) What is dataset Gold standard?

Gold standard or Gold set test refers to a diagnostic, test or benchmark that is, one that can be accepted as the best available under reasonable conditions. It can also be defined as the most valid one and the most used one by researchers. In case we take the example of medicine, doctors refer to blood assay as a gold standard for checking patients for medication adherence.


69) What is Ensemble Learning?

Ensemble learning technique is basically combining a diverse set of learners (Individual models) together to improvise on the stability and predictive power of the model thereby helping in improving the accuracy of the model.

70) Describe in brief any type of Ensemble Learning?

Ensemble learning has many sorts but two more popular ensemble learning techniques are mentioned below.


Bagging tries to implement similar learners on small sample populations then takes a mean of all the predictions. This approach uses the same algorithm for every predictor (e.g. all decision tree), however, having different random subsets of the training set allowing a more generalised result. For creation of the subsets, we can either proceed with a replacement or without replacement.



It is also known as hypothesis boosting and is an ensemble method which adjusts the weight of an observation based on the last classification. It is a sequential process where each subsequent model attempts to fix the errors of its predecessor that is If an observation was classified incorrectly it tries to increase the weight of this observation and vice versa. Boosting generally decreases the bias error and builds strong predictive models. However, they may over fit on the training data.

Shopping Cart