# data science interview questions and answers.

### 1. Between Python and R which one would you pick for text analytics and why?

Ans. For text analytics, python will gain an upper hand over R due to the following reasons:

The pandas’ library in Python offers easy-to-use data structures. python has faster performance for all types of text analytics. R is best-fit for machine learning than mere text analysis.

### 2. Explain Eigenvectors and Eigenvalues.

Ans. Eigenvectors help in understanding linear transformations. They are calculated typically for a correlation or convenience matrix in data analysis. Eigenvalues can be understood by strengths of the transformation in the direction of eigenvectors of the transformation in the direction of eigenvectors or the factors by which the compression happens.

### 3. What do you understand by linear & logistic regression?

Ans. Linear regression is a form of statistical technique in which the score of the same variable y is predicted on the basis of core of the second variable x, referred to as the predictor variable. Logistic regression is a statistical technique for predicting binary outcomes from a linear combination of predictor variables.

### 4. What do you know about Autoencoders?

Ans. Autoencoders are simplistic learning networks used for transforming inputs into output with minimum possible error. An autoencoders receives unlabeled input that is encoded for reconstructing output.

### 5. Explain the concept of the Boltzmann machine.

Ans. A Boltzmann machine features a simple learning algorithm that enables the same to discover fascinating features representing complex regularities present in the training data. The simple learning algorithm involved in a Boltzmann layer of feature detectors.

### 6. What are the skills required as a data scientist that could help in using Python for Data Analysis Purposes?

Ans. The skills required as a data scientist that could help in using Python for data analysis are:

Expertise in panda’s Dataframes, scikitlearn, and N-dimensional Numpy Arrays. Skills to apply element -wise vector and matrix operations on Numpy arrays. Able to understand built-in data types; tuple sets, dictionaries and various others. Knowledge of python script and optimizing bottleneck.

### 7. What is full form of GAN? Explain GAN.

Ans. The full form of GAN is Generative Adversarial Network. Its task is to take inputs from the noise vector and send it forward to the generator and then to Descriminator to identify and differentiate unique and take inputs.

### 8. what are the vital component of GAN?

Ans. There are two vital Component of GAN:

1.Generator: The generator act as a forger, which creates fake copies.

2.descriminator: The Descriminator act as a recognizer for fake and unique copies.

### 9. What is the computation Graph?

Ans. A computational graph is a graphical representation that is based on tensor flow. It is a wide network of different kinds of nodes wherein each node represents a particular mathematical operation. The edges in this nodes are called Tensors. This is reason computational graph is called a Tensor flow of inputs.

### 10. What are Tensors?

Ans. Tensors are the mathematical objects that represent the collection of higher dimensians of data inputs in form of alphabets, humerals and fed as inputs to the neural network.

### 11. What is Batch Normalization in Data Science?

Ans. Batch normalization is Data Science is a technique through which attempts could be more to Improve. the performance and stability of the neural networks. This can be done by normalizing the inputs. In each layer so that the mean output activation. remains 0 with the standard deviation at 1.

### 12. What is dropout in Data Sciences?

Ans. Dropout is a toll in Data science, which is used for dropping out the hidden and visible units of network on a random basis. They prevent the Overfitting of the data by dropping as much as 20% of the nodes so that required space can be arranged for iterations needed to coverage network.

### 13. What are the different types of Deep learning frameworks?

Ans. The different types of Deep Learning framework includes the following:

1. caffe
2. keras
3. Tensor flow
4. Pyturch
5. chainer
6. microsoft cognitive Toolkit.

### 14. What are the various machine libraries and their benefits?

Ans. The various machine learning libraries and their benefits are as follows:

1. Nupy: Used for Scientific computations.
2. Statsmodels: Used for time-series analysis.
3. Pandas: Used for tabular data analysis.
4. TensorFlow: Used for deep learning process.
5. NLTK: Used for text processing.

### 15. What is the Full form of LSTM? What is function of LSTM?

Ans. LSTM is for Long Short Term Memory. It is a Recurrent neural network that is capable of learning long term dependencies and reaching information for the longer period as part of its default behavior.

Ans.

### 17. What are the feature selection methods used to select the right variables?

Ans. There are two main methods for feature selection, i.e, filter, and wrapper methods.

Filter Methods

This involves:

• Linear discrimination analysis
• ANOVA
• Chi-Square

The best analogy for selecting features is “bad data in, bad answer out.” When we’re limiting or selecting the features, it’s all about cleaning up the data coming in.

#### Wrapper Methods

This involves:

• Forward Selection: We test one feature at a time and keep adding them until we get a good fit
• Backward Selection: We test all the features and start removing them to see what works better
• Recursive Feature Elimination: Recursively looks through all the different features and how they pair together

Wrapper methods are very labor-intensive, and high-end computers are needed if a lot of data analysis is performed with the wrapper method.

### 18. For the given points, how will you calculate the Euclidean distance in Python?

Ans.

plot1 = [1,3]

plot2 = [2,5]

The Euclidean distance can be calculated as follows:

euclidean_distance = sqrt( (plot1[0]-plot2[0])**2 + (plot1[1]-plot2[1])**2 )

Check out the Simplilearn’s video on “Data Science Interview Question” curated by industry experts to help you prepare for an interview.

### 19. What are dimensionality reduction and its benefits?

Ans. The Dimensionality reduction refers to the process of converting a data set with vast dimensions into data with fewer dimensions (fields) to convey similar information concisely.

This reduction helps in compressing data and reducing storage space. It also reduces computation time as fewer dimensions lead to less computing. It removes redundant features; for example, there’s no point in storing a value in two different units (meters and inches).

### 20. How should you maintain a deployed model?

Ans. The steps to maintain a deployed model are:

#### Monitor

Constant monitoring of all models is needed to determine their performance accuracy. When you change something, you want to figure out how your changes are going to affect things. This needs to be monitored to ensure it’s doing what it’s supposed to do.

#### Evaluate

Evaluation metrics of the current model are calculated to determine if a new algorithm is needed.

#### Compare

The new models are compared to each other to determine which model performs the best.

#### Rebuild

The best performing model is re-built on the current state of data.

### 21. What are recommender systems?

Ans. A recommender system predicts how a user would rate a specific product based on their preferences. It can be split into two different areas:

#### Collaborative Filtering

As an example, Last. fm recommends tracks that other users with similar interests play often. This is also commonly seen on Amazon after making a purchase; customers may notice the following message accompanied by product recommendations: “Users who bought this also boughtâ€¦”

#### Content-based Filtering

As an example: Pandora uses the properties of a song to recommend music with similar properties. Here, we look at content, instead of looking at who else is listening to music.

### 22. After studying the behavior of a population, you have identified four specific individual types that are valuable to your study. You would like to find all users who are most similar to each type. Which algorithm is most appropriate for this study?

Ans. Choose the correct option:

1. K-means clustering
2. Linear regression
3. Association rules
4. Decision trees

As we are looking for grouping people together specifically by four different similarities, it indicates the value of k. Therefore, K-means clustering (answer A) is the most appropriate algorithm for this study.

### 23. You have run the association rules algorithm on your dataset, and the two rules {banana, apple} => {grape} and {apple, orange} => {grape} have been found to be relevant. What else must be true?

1. {banana, apple, grape, orange} must be a frequent itemset
2. {banana, apple} => {orange} must be a relevant rule
3. {grape} => {banana, apple} must be a relevant rule
4. {grape, apple} must be a frequent itemset

The answer is A: {grape, apple} must be a frequent itemset

### 24.Â Why is R used in Data Visualization?

Ans. R is widely used in Data Visualizations for the following reasons-

• We can create almost any type of graph using R.
• R has multiple libraries like lattice, ggplot2, leaflet, etc., and so many inbuilt functions as well.
• It is easier to customize graphics in R compared to Python.
• R is used in feature engineering and in exploratory data analysis as well.

### 25. What does NLP stand for?

Ans. NLP is short for Natural Language Processing. It deals with the study of how computers learn a massive amount of textual data through programming. A few popular examples of NLP are Stemming, Sentimental Analysis, Tokenization, removal of stop words, etc.