Linear Discriminant Analysis is a very popular Machine Learning technique that is used to solve classification problems. In this article we will try to understand the intuition and mathematics behind this technique. An example of implementation of LDA in R is also provided.

- Linear Discriminant Analysis Assumption
- Intuitions
- Mathematical Description of LDA
- Learning the Model Parameters
- Example in R

**Linear Discriminant Analysis Assumption**

Linear Discriminant Analysis is based on the following assumptions:

The dependent variable

*Y*is discrete. In this article we will assume that the dependent variable is binary and takes class values*{+1, -1}*. The probability of a sample belonging to class*+1*, i.e*P(Y = +1) = p*. Therefore, the probability of a sample belonging to class*-1*is*1-p*.The independent variable(s)

*X*come from gaussian distributions. The mean of the gaussian distribution depends on the class label*Y*. i.e. if*Y**i**= +1*, then the mean of*X**i*is*𝜇**+1*, else it is*𝜇**-1*. The variance*𝜎**2*is the same for both classes. Mathematically speaking,*X|(Y = +1) ~ N(𝜇**+1**, 𝜎**2**)*and*X|(Y = -1) ~ N(𝜇**-1**, 𝜎**2**)*, where*N*denotes the normal distribution.

With this information it is possible to construct a joint distribution *P(X,Y)* for the independent and dependent variable. Therefore, LDA belongs to the class of **Generative Classifier Models**. A closely related generative classifier is Quadratic Discriminant Analysis(QDA). It is based on all the same assumptions of LDA, except that the class variances are different.

Let us continue with Linear Discriminant Analysis article and see

**Intuition**

Consider the class conditional gaussian distributions for *X* given the class *Y*. The below figure shows the density functions of the distributions. In this figure, if *Y = +1*, then the mean of *X* is 10 and if *Y = -1*, the mean is 2. The variance is 2 in both cases.

Now suppose a new value of *X* is given to us. Lets just denote it as *x**i*. The task is to determine the most likely class label for this *x**i*, i.e. *y**i*. For simplicity assume that the probability *p* of the sample belonging to class *+1* is the same as that of belonging to class *-1*, i.e. *p=0.5*.

Intuitively, it makes sense to say that if *x**i* is closer to *𝜇**+1* than it is to *𝜇**-1*, then it is more likely that *y**i** = +1*. More formally, *y**i** = +1* if:

*|x**i **– 𝜇**+1**| < |x**i **– 𝜇**-1**|*

Normalizing both sides by the standard deviation:

*|x**i **– 𝜇**+1**|/𝜎 < |x**i **– 𝜇**-1**|/𝜎*

Squaring both sides:

*(x**i **– 𝜇**+1**)**2**/𝜎**2** < (x**i **– 𝜇**-1**)**2**/𝜎**2*

*x**i**2**/𝜎**2** + 𝜇**+1**2**/𝜎**2** – 2 x**i**𝜇**+1**/𝜎**2** < x**i**2**/𝜎**2** + 𝜇**-1**2**/𝜎**2** – 2 x**i**𝜇**-1**/𝜎**2** *

*2 x**i** (𝜇**-1** – 𝜇**+1**)/𝜎**2** – (𝜇**-1**2**/𝜎**2** – 𝜇**+1**2**/𝜎**2**) < 0*

*-2 x**i** (𝜇**-1** – 𝜇**+1**)/𝜎**2** + (𝜇**-1**2**/𝜎**2** – 𝜇**+1**2**/𝜎**2**) > 0*

The above expression is of the form *bx**i** + c > 0* where *b = -2(𝜇**-1** – 𝜇**+1**)/𝜎**2* and *c = (𝜇**-1**2**/𝜎**2** – 𝜇**+1**2**/𝜎**2**)*.

It is apparent that the form of the equation is **linear**, hence the name Linear Discriminant Analysis.

Let us continue with Linear Discriminant Analysis article and see,

**Mathematical Description of LDA**

The mathematical derivation of the expression for LDA is based on concepts like Bayes Rule and **Bayes Optimal Classifier**. Interested readers are encouraged to read more about these concepts. One way to derive the expression can be found here.

We will provide the expression directly for our specific case where *Y* takes two classes *{+1, -1}*. We will also extend the intuition shown in the previous section to the general case where *X* can be multidimensional. Let’s say that there are *k* independent variables. In this case, the class means *𝜇**-1* and *𝜇**+1* would be vectors of dimensions *k*1* and the variance-covariance matrix *𝜮* would be a matrix of dimensions *k*k*.

The classifier function is given as

*Y = h(X) = sign(b**T**X + c)*

Where,

*b = -2 𝜮 **-1**(𝜇**-1** – 𝜇**+1**)*

*c = 𝜇**-1**T**𝜮 **-1**𝜇**-1** – 𝜇**-1**T**𝜮 **-1**𝜇**-1** -2 ln{(1-p)/p}*

The sign function returns *+1* if the expression *b**T**x + c > 0*, otherwise it returns *-1*. The natural log term in *c* is present to adjust for the fact that the class probabilities need not be equal for both the classes, i.e. *p* could be any value between (0, 1), and not just 0.5.

**Learning the Model Parameters**

Given a dataset with *N* data-points *(x**1**, y**1**), (x**2**, y**2**), … (x**n**, y**n**)*, we need to estimate *p, 𝜇**-1**, 𝜇**+1* and *𝜮*. A statistical estimation technique called Maximum Likelihood Estimation is used to estimate these parameters. The expressions for the above parameters are given below.

*𝜇**+1**= (1/N**+1**) * *𝚺*i:yi=+1 **x**i*

*𝜇**-1** = (1/N**-1**) * *𝚺*i:yi=-1 **x**i*

*p = N**+1**/N*

*𝜮 = (1/N) * *𝚺i*=1:N**(x**i** – 𝜇**i**)(x**i** – 𝜇**i**)**T*

Where *N**+1** = number of samples where y**i** = +1* and *N**-1** = number of samples where y**i** = -1*.

With the above expressions, the LDA model is complete. One can estimate the model parameters using the above expressions and use them in the classifier function to get the class label of any new input value of independent variable *X*.

Let us continue with Linear Discriminant Analysis article and see

**Example in R**

The following code generates a dummy data set with two independent variables *X1* and *X2* and a dependent variable *Y*. For *X1* and *X2*, we will generate sample from two multivariate gaussian distributions with means *𝜇**-1**= (2, 2)* and *𝜇**+1**= (6, 6)*. 40% of the samples belong to class *+1* and 60% belong to class *-1*, therefore *p = 0.4*.

library(ggplot2) library(MASS) library(mvtnorm) #Variance Covariance matrix for random bivariate gaussian sample var_covar = matrix(data = c(1.5, 0.3, 0.3, 1.5), nrow=2) #Random bivariate gaussian samples for class +1 Xplus1 <- rmvnorm(400, mean = c(6, 6), sigma = var_covar) # Random bivariate gaussian samples for class -1 Xminus1 <- rmvnorm(600, mean = c(2, 2), sigma = var_covar) #Samples for the dependent variable Y_samples <- c(rep(1, 400), rep(-1, 600)) #Combining the independent and dependent variables into a dataframe dataset <- as.data.frame(cbind(rbind(Xplus1, Xminus1), Y_samples)) colnames(dataset) <- c("X1", "X2", "Y") dataset$Y <- as.character(dataset$Y) #Plot the above samples and color by class labels ggplot(data = dataset)+ geom_point(aes(X1, X2, color = Y))

In the above figure, the blue dots represent samples from class *+1* and the red ones represent the sample from class *-1*. There is some overlap between the samples, i.e. the classes cannot be separated completely with a simple line. In other words they are not perfectly **linearly separable**.

We will now train a LDA model using the above data.

#Train the LDA model using the above dataset lda_model <- lda(Y ~ X1 + X2, data = dataset) #Print the LDA model lda_model

**Output:**

Prior probabilities of groups:

-1 1

0.6 0.4

Group means:

X1 X2

-1 1.928108 2.010226

1 5.961004 6.015438

Coefficients of linear discriminants:

LD1

X1 0.5646116

X2 0.5004175

As one can see, the class means learnt by the model are (1.928108, 2.010226) for class *-1* and (5.961004, 6.015438) for class *+1*. These means are very close to the class means we had used to generate these random samples. The prior probability for group *+1* is the estimate for the parameter *p*. The *b* vector is the linear discriminant coefficients.

We will now use the above model to predict the class labels for the same data.

#Predicting the class for each sample in the above dataset using the LDA model y_pred <- predict(lda_model, newdata = dataset)$class #Adding the predictions as another column in the dataframe dataset$Y_lda_prediction <- as.character(y_pred) #Plot the above samples and color by actual and predicted class labels dataset$Y_actual_pred <- paste(dataset$Y, dataset$Y_lda_prediction, sep=",") ggplot(data = dataset)+ geom_point(aes(X1, X2, color = Y_actual_pred))</p>

In the above figure, the purple samples are from class *+1* that were classified correctly by the LDA model. Similarly, the red samples are from class *-1* that were classified correctly. The blue ones are from class *+1* but were classified incorrectly as *-1*. The green ones are from class *-1* which were misclassified as *+1*. The misclassifications are happening because these samples are closer to the other class mean (centre) than their actual class mean.

*This brings us to the end of this article, check out the R training*

**by Edureka, a trusted online learning company with a network of more than 250,000 satisfied learners spread across the globe. Edureka’s Data Analytics with R training will help you gain expertise in R Programming, Data Manipulation, Exploratory Data Analysis, Data Visualization, Data Mining, Regression, Sentiment Analysis and using R Studio for real life case studies on Retail, Social Media.***Got a question for us? Please mention it in the comments section of this article and we will get back to you as soon as possible.*