What is Exploratory Data Analysis?
It is an approach to analyze data sets to summarize their main characteristics, often with visual methods. It is most useful while identifying outliers, trends and patterns. It was promoted by John Tukey to encourage statisticians to explore the data. Tukey also invented the term “bit”, “software”,and extended the “jackknife method”.
Let’s assume that the user wishes to find the time during which he can expect employees to come to office.
The factors can be weather, driving condition, distance, family setup, quality of road, absentism in the past and so on. Let’s assume that the in time in the office is not dependent on the mentioned factors which can be termed as the Null Hypothesis.
The first step in Exploratory Data Analysis is defining the Null Hypothesis and then running the model in order to calculate the value of p. We then accept or reject the Null Hypothesis based on the p value.
Let’s say that out of 7 variables, if p values of 5 variables states that we can reject the Null Hypothesis which directly means that those specific factors are influencing the dependent variables.
Also note that a low p value is always preferred. If the user wants to be 95% confident of his/her results, the p value must be (>0.05).
Data- MT Cars Example
Step 1. We first need to check the number of observations and columns. We do a ‘Str’ (Structure) of data through the code:
This in turn will give the output showing the structure of data.
Step 2. We check how the first five row of the data looks (which is available in the output)
Step 3. We then type the code:
This statement will give the minimum and maximum values.
This statement will give the histogram of the values. It then gives the 10 buckets of data and maximum number of frequency.
The Interquartile rage is the difference between the 75th percentile value and 25th percentile value in the code:
The interquartile range is used to find the outliers.
Step 4: We then draw a box plot through the statement:
Here, in the box plot, we see a relationship between the 5num command and the box plot. 5num command gives 5 numbers (minimum value, 25th percentile value, median value, 75th percentile value and the maximum value).
The link: www.flowingdata.com/2008/02/15/how-to-read-and-use-a-box-and-whisker-plot will help in reading the box plot.
We then use box plot to compare quantities or variables. Then we perform an ‘str’ action on the dataset ‘datairis’ with the command:
This will then pull data with elements like “Sepal.length”, “Sepal.width”, “Petal.length”, ”Petal.width” and “Species”. The data set basically talks about flowers. Once we do the box plot for the data, we get:
The insights from the diagram are:
- Outliers in the virginica
- The bubble below the virginica is an outlier
- The maximum length of setosa is less than the median level of versicolor
- Virginica has the maximum sepal length and Setosa has the minimum sepal length
- Virginica has a broader range of minimum and maximum value
The command line is as follows:
Boxplot(iris$Sepal.length ~ iris$Species)
We then give the name of the plot, change the xlab /ylab and colours. We can also change the background through the command:
Drawing Box Plot with R Package
We perform the same operation with the package “lattice” with ‘bw plot(box whisker)’. We change the title, x-axis, y-axis as per the data. We add pch to 2 in order to notify the objects. The geometry name is changed to ‘boxplot’ and the plot name to ‘qplot’ through the following statement:
The following output is observed in the form of the box plot:
We look into the summary of another dataset “mtcars” which has elements like minimum value, first quartile, median, third quartile and the maximum value. We then make a comparison on how the average of a car varies if we change from one cylinder to another (cylinder and average are variables in the data set).
We subsequently analyze the box plot:
The x-axis will have the cylinder and the y-axis will have average. The insights derived from the box plot are:
- The lesser the cylinder, higher the miles per gallon
- 8-cylinder car has the minimum mileage
We then run the command for the ‘bw’ plot and ‘q’ plot through the command line:
Qplot(mtcars$cyl, mtcars$mpg, data-mtcars, geom.=”boxplot”)
The insights derived further are:
- The maximum mileage of 8-cylinder car is almost equal to the median mileage of 6 cylindar cars.
- The maximum mileage of 6-cylinder car is almost equal to minimum mileage of 4 cylinder car.
The Describe function is performed through the following code:
Library(Hmisc) Describe(mtcars) Describe(mtcars$mpg)
It will give output containing information such as the number of variables (11), observations (32), unique values and missing values.
Got a question for us?? Mention them in the comments section and we will get back to you.