preloader

Loading ...

0
SeekACE Academy . 1st Jul, 2021

DATA SCIENCE INTERVIEW PART-3

31) What are Interpolation and Extrapolation?

Interpolation is a very useful statistical and mathematical tool to estimate value from two unknown values from a list of known values. Investors and stock analysts often use line charts to visualise changes in the price of securities. Extrapolation is approximating something by extending a known set of values or facts. This is the process of estimating something if the present situation continues and is an important concept not only in Mathematics but also in other disciplines like Psychology, Sociology, Statistics etc.

32) What’s power analysis?

An experimental design technique, usually conducted before data collection, for determining the probability of detecting an effect of a given size with a given level of confidence, under-sample size constraints are called Statistic Power Analysis.

33)  What is K-means? How can you select K for K-means?

K clustering is an iterative algorithm method which helps in solving the clustering problem in cases when we have unlabelled data. It aims to partition the observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centres or cluster centroid), serving as a prototype of the cluster. Its versatility makes it suitable to be used for any type of groupings.

You can choose the number of clusters visually but there is a lot of ambiguity. You just have to experiment on instance data set, see the results of different K, and find the better one.

34) How is K-NN different from K-means clustering?

Both K-NN and K-means may seem similar and easy to confuse with one another, because they both involve comparing the distances of a given input data point to a set of other stored data points but they tackle different problems.

  1. K-NN is Supervised machine learning technique whereas K-means is an unsupervised machine learning.
  2. K-NN is a classification or regression machine learning algorithm while K-means is a clustering machine learning algorithm.
  3. K-NN is a lazy learnerwhich means it does not have a training phase whereas K-Means is an eager learner implying it has a model fitting that means, a training step.
  4. K-NN performs much better if all of the data have the same scale and is the simplest ML algorithm, but this does not hold true for K-means.

35) What is cluster sampling?

Cluster sampling is a sampling plan used in statistics when natural groups are present in a population. This sampling plan divides the whole population into clusters or groups, and random samples are then collected from each group.

Cluster sampling is typically used in market research. It’s mostly used when researchers can’t get enough information about the population as a whole but they can get information about the clusters. For example, a researcher may be interested in data about city taxes in any state. The researcher would collect data from selected cities and then compile them to get a picture of the state. The individual cities, in this case, would be the clusters. Cluster sampling is usually more economical or more practical than stratified sampling or simple random sampling.

36)  What is the difference between Cluster and Systematic Sampling?

Cluster sampling is the most economical and practical sampling technique used when it becomes difficult to study the target population spread across a wide area and simple random sampling cannot be applied. This technique enables the researchers to creates multiple clusters of people from a population where they are indicative of homogeneous characteristics and stand an equal chance of being a part of the sample. Systematic sampling is a random statistical technique where elements are selected from an ordered sampling frame. In systematic sampling, researchers calculate the sampling interval by dividing the entire population size by the desired sample size. The most suitable example of systematic sampling is equal probability method.

37) Explain the steps in making a decision tree.

1. Identifying the points of decision, the type or range of alternative outcomes at each point and also the order in which they must be made.

2. Calculate the values needed to make the analysis, especially the probabilities of different events or results of action and the costs and gains of various events and actions.

3. Analyse the alternative values to choose a course.

4. Constructing a decision tree showing the sequence of decisions and chance events.

38) What is a random forest model? State its advantages and disadvantages?

A random forest model is built up of thousands of decision trees each trained with a slightly different set of observations and picks predictions from each tree. If you split the data into different packages and make a decision tree in each of the different groups of data, the random forest brings all those trees together.

A Random forest algorithm is considered as a highly accurate algorithm because of the results derived due to building of multiple decision trees. Since it considers all multiple decision tree outputs, no bias is generated in results, so there is no issue of overfitting. It is an important feature of selection process and can be used to build both random forest classification and regression models.

Despite of its several advantages a random tree forest model is comparatively slow in generating predictions as it has to build multiple decision trees.

39) How are you able to avoid the overfitting in your model?

Overfitting refers to a model that’s trained or fitted with tons of data thanks to which it starts learning from noise and inaccurate data entries in our model set. It doesn’t categorize the information correctly, due to too many details and noise. Overfitting can be reduced by implementing the following steps-

1. Increasing the training data.

2. Keep the model simple—take fewer variables under consideration, thereby removing a number of the noise within the training data.

3. Use regularization techniques, like LASSO, that penalize certain model parameters if they’re likely to cause overfitting.

4. Early stopping during the training phase that’s keeping an eye fixed, over the loss over the training period and stopping the moment the loss begins to extend.

5. Use dropout for neural networks to tackle overfitting.

40) If your training model gives 90% accuracy whereas the testing model gives 60%, Identify the problem in your model?

This problem can be termed as the problem of overfitting which refers to a model that models the training data too well. Overfitting is bad because the model learns the detail and noise in training data and consequently makes poor predictions which imparts the performance of the model on new data. This problem can be reduced by many methods such as Regularisation, Cross validation, Ensembling, Early stopping etc

Shopping Cart

Loading...