## Data Science and Machine Learning Internship ...

- 22k Enrolled Learners
- Weekend/Weekday
- Live Class

Before we proceed with analysis of the bank data using R, let me give a quick introduction to R. R is a well-defined integrated suite of software for data manipulation, calculation and graphical display.

- Effective and fast data handling and storage facility.
- A bunch of operators for calculations on arrays, lists, vectors etc.
- A large integrated collection of tools for data analysis, and visualization.
- Facilities for data analysis using graphs and display either directly at the computer or paper.
- A well implemented and effective programming language called ‘S’ on top of which R is built.
- A complete range of packages to extend and enrich the functionality of R.

We call R an environment within which many classical and modern statistical techniques have been integrated. There are about 25 packages supplied with R and around more than 3000 are available through the Comprehensive R Archive Network (CRAN) family of Internet sites (via http://CRAN.R-project.org) and elsewhere.

Though R is a great software, but it isn’t the right tool for every problem. You should know your problem and limitations of R well before you use it. R is very good at plotting graphics, analyzing data, and fitting statistical models using data that fits in the computer’s memory. It’s not as good at storing data in complicated structures, efficiently querying data, or working with data that doesn’t fit in the computer’s memory.

That’s the reason I used Hadoop to pre-process large files before moving them into R. Though you can do the same task in R as well using R regular expression and binding operations but it would be quite new and complex for a beginner. Also, R uses the RAM to hold dataset. You can’t therefore hold large values in R if you do not have a large memory. To hold large data files, I usually use a database like MySQL, or a framework like Hadoop.

**Capability**

There are thousands of statistical and data analysis algorithms in R. None of its counterparts offers this many variety in functionality that is available through the CRAN.

**Community**

There are million users of R worldwide and they are growing exponentially due to its capabilities. You can always share your knowledge, doubts or suggestion with them through various forums.

**Performance**

R’s performance is excellent compared to other commercial analysis packages. R loads datasets into memory before processing. The only thing you should have is a good configuration machine to use its functionality to maximum extent. I think everyone can now go for higher memory machines as memories are quite cheap today than the time when R was developed. That’s probably one of the greatest reasons why R users are growing at this pace.

The overall process of finding and interpreting patterns from data involves the repeated application of the following steps:

- Loading and developing an understanding of:
- The application domain
- The relevant prior knowledge
- The goals of the end-user

- Creating a target dataset: Selecting a dataset, or focusing on a subset of variables, or data samples, on which discovery is to be performed.
- Data cleaning and pre-processing.
- Removal of noise or outliers.
- Collecting necessary information to model or account for noise.
- Strategies for handling missing data fields.
- Accounting for time sequence information and known changes.

- Data reduction and projection.
- Finding useful features to represent the data depending on the goal of the task.
- Using dimensionality reduction or transformation methods to reduce the effective number of variables under consideration or to find invariant representations for the data.

- Choosing the data mining task.
- Deciding whether the goal of the KDD process is classification, regression, clustering, etc.

- Choosing the data mining algorithm(s).
- Selecting method(s) to be used for searching for patterns in the data.
- Deciding which models and parameters may be appropriate.
- Matching a particular data mining method with the overall criteria of the KDD process.

- Data mining.
- Searching for patterns of interest in a particular representational form or a set of such representations as classification rules or trees, regression, clustering, and so forth.

- Interpreting mined patterns.
- Consolidating discovered knowledge.

**Step 1: Loading and developing an understanding of the data**

We import the files that we saved in HDFS in Hadoop from Pig named “combined_out”

**## setting Hadoop variables for Hadoop in R environment**

Sys.setenv(JAVA_HOME="/home/abhay/java") Sys.setenv(HADOOP_HOME="/home/abhay/hadoop") Sys.setenv(HADOOP_CMD="/home/abhay/hadoop/bin/hadoop") Sys.setenv(HADOOP_STREAMING="/home/abhay/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar")

**##loading RHadoop packages**

library(rmr2) library(rhdfs) hdfs.init()

**##setting Hadoop root path and reading files from HDFS**

hdfs.root <- '/bank_project' hdfs.data <- file.path(hdfs.root, 'combined_out/part-r-00000') final_bank_data <- hdfs.read.text.file(hdfs.data) content<-hdfs.read.text.file(hdfs.data) clickpath<-read.table(textConnection(content),sep=",")

**Step 2 : Creating a target dataset**

**##naming all the columns fetched from HDFS**

colnames(clickpath) <- c("ac_id","disposal_type","age","sex","card_type","dist","avg_sal","unemp_rate","entrepreneur_no","trans_sum","loan_amount","loan_status")

**Step 3 : Data cleaning and pre-processing**

**##checking structure of the fetched data**

str(clickpath)

**##list of rows with missing values**

clickpath[!complete.cases(clickpath),]

**##list of columns with missing values**

clickpath[,!complete.cases(clickpath)]

**## If any missing values are there omit them**

clickpath <- na.omit(clickpath,na.action=TRUE)

**Step 4 : Data reduction and projection**

**##selecting only numerical data and removing ac_id column**

mydata <- clickpath[,c(3,7:11)]

**# First check the complete set of components for outliers**

**## As we can see from the above plots that avg_sal,unemp_rate and loan amount has some outliers in the data. Let’s analyze all these three individually.**

**# outlier in avg_sal**

**# Since avg_sal is one of the most useful things which we can’t simply ignore without much investigation. Hence we would check the scatterplot of this entry for more clarity.**

plot(mydata[,c(2)])

**## from this plot we observe that there are so many entries whose values are in the outlier category. Hence, it would not be a good idea to remove this outlier.**

**#outlier in unemp_rate**

**## From this graph we can see that there are a few entries which lie as outliers. Since this value is not that necessary, hence we can decide to reduce outliers from this entry. There may be many ways of removal of outliers. I chose to replace outliers with the maximum values which is ~1.5.**

**## defining function to replace outliers**

library(data.table) outlierReplace = function(dataframe, cols, rows, newValue = NA) { if (any(rows)) { set(dataframe, rows, cols, newValue) } }

**#calling the outlier Replace function for entry unemp_rate to replace all the outliers with maximum category value.**

outlierReplace(clickpath, "unemp_rate", which(mydata$unemp_rate > 1.5), 1.5)

**## now checking the five num summary of the entry to verify if the outliers has been replaced.**

fivenum(mydata$unemp_rate)

**#outlier in loan_amount**

**## from this plot we observe that there are so many entries whose values are in the outlier category. Considering sensitivity of this entry it would not be a good idea to remove this outlier.**

**##Since the data attributes are of different varieties their scales are also different. In order to maintain uniform scalability we scale the columns.**

mydata <- scale(mydata[,1:7])

**Step 5 : Choosing the data mining task**

**## Calculating variance and storing at the first index in wss**

wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))

**Step 6 : Choosing the data mining algorithm(s)**

**##We are going to use k-means algorithm for this clustering.**

Clustering analysis??

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are similar (in some sense or another) to each other than to those in other groups (clusters). It is the main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics.

Cluster analysis as such is not an automatic task, but an iterative process of knowledge discovery or interactive multi-objective optimization that involves trial and failure.

K-Means Algorithm Properties

There are always K clusters.

There is always at least one item in each cluster.

The clusters are non-hierarchical and they do not overlap.

Every member of a cluster is closer to its cluster than any other cluster because closeness does not always involve the ‘center’ of clusters.

The K-Means Algorithm Process

- The dataset is partitioned into K clusters and the data points are randomly assigned to the clusters resulting in clusters that have roughly the same number of data points.
- For each data point: Calculate the distance from the data point to each cluster.
- If the data point is closest to its own cluster, leave it where it is. If the data point is not closest to its own cluster, move it into the closest cluster.
- Repeat the above step until a complete pass through all the data points results in no data point moving from one cluster to another. At this point the clusters are stable and the clustering process ends.

The choice of initial partition can greatly affect the final clusters that result in terms of inter-cluster and intra-cluster distances and cohesion.

**## iterate through wss array 15 times and sum up all the variance in every iteration and store it in wss array**

for(i in 2:15)wss[i]<- sum(fit=kmeans(mydata,centers=i,15)$withinss)

**## plot each iteration to display the elbow graph **

plot(1:15,wss,type="b",main="15 clusters",xlab="no. of cluster",ylab="with clsuter sum of squares")

**Step 7 : Searching for patterns of interest in a particular representational form**

**##As we can see from the above output the slope of the graph changes majorly in 3 iteration, hence we consider the optimized number of cluster as 3 in which we can get the optimum result**

fit <- kmeans(mydata,3)

## Let’s check the summary of the kmeans objects

kmeans returns an object of class which has a print and a fitted method. It is a list with at least the following components:

`cluster` | A vector of integers (from 1:k) indicating the cluster to which each point is allocated. |

`centers` | A matrix of cluster centres. |

`totss` | The total sum of squares. |

`withinss` | Vector of within-cluster sum of squares, one component per cluster. |

`tot.withinss` | Total within-cluster sum of squares, i.e. `sum(withinss)` . |

`betweenss` | The between-cluster sum of squares, i.e. `totss-tot.withinss` . |

`size` | The number of points in each cluster. |

`iter` | The number of (outer) iterations. |

`ifault` | integer: indicator of a possible algorithm problem – for experts. |

**## checking withinss i.e. the intra cluster bond strength factor for each cluster**

fit$withinss

**## checking betweenss i.e. the inter cluster distance between cluster**

fit$betweenss

fit$size

**Step 8 : Interpreting mined patterns**

plot(mydata,col=fit$cluster,pch=15) points(fit$centers,col=1:8,pch=3)

library(cluster) library(fpc) plotcluster(mydata,fit$cluster) points(fit$centers,col=1:8,pch=16)

clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)

**## checking mean for each object in each cluster**

**Usually, as the result of a k-means clustering analysis, we would examine the means for each cluster on each dimension to assess how distinct our k clusters are.**

mydata <- clickpath[,c(3,7:12)] mydata <- data.frame(mydata,fit$cluster) cluster_mean <- aggregate(mydata[,1:8],by = list(fit$cluster),FUN = mean) cluster_mean

K-means clustering in particular when using heuristics such as Lloyd’s algorithm is rather easy to implement and apply even on large datasets. It has been successfully used in various fields, including market segmentation, computer vision, geostatistics, astronomy and agriculture. It often is used as a preprocessing step for other algorithms, for example to find a starting configuration.

**In the next blog we would learn about application of logistic regression and market basket analysis on bank data.**

*Got a question for us? Please mention it in the comments section and we will get back to you.*

**Related Posts:**

Implementing Hadoop and R Analytic Skills in Banking Domain

Comments 11 Comments

REGISTER FOR FREE WEBINAR

Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP

edureka.co

how to start doing kaggle projects let me know plz. i have already done with all ml models. i need to explore more and how to start doing it in r tool.abhay help.. with this. topic.

Do you have the dataset used for this analysis? and could you provide the same.

hey! can you please provide a python code for this??

Hey Shivani, thanks for checking out our blog. Sorry to say that we do not have Python code for this blog. Do check out our other blogs too. Cheers!

It depends upon the objective of your project??If you share the use case i can help.

i did work with hadoop for kmeans (i hv taken text based unstured data)…and i got clusters …..then how i proceed my project plz help me…………….

Hello! Can you please give us more details about your project and where you need the help. Our technical team is waiting to assist you!

Great tutorial Abhay. Is there any change we can get the dataset CLICKPATH to follow the steps one by one? It would be great. But thanks again for sharing.

I cant install package “rmrr2” and “rhdfs” on R 3.2.0. Error message is ”

Warning in install.packages :package ‘rmr2’ is not available (for R version 3.2.0)”. please help ?”

Hi Abhay, Thanks for sharing! The codes and explanations are neat! Just have one question, I am dealing with data on excel with more than 50k rows. I tried kmean, hierarchical clustering, model based clustering analysis. Only kmean work out because of the large data size. However, clusters by kmean cannot really show differentiation between clusters. So I am wondering is there any other way to do clustering analysis? Thanks!

Hello,

@Shawn,Kmeans is used for numerci variable

you shoud use hierchical clustring

hope that help you