Published on Jun 20,2018
33.9K Views
Email Post

Before we proceed with analysis of the bank data using R, let me give a quick introduction to R. R is a well-defined integrated suite of software for data manipulation, calculation and graphical display.

The core features of R includes:

  • Effective and fast data handling and storage facility.
  • A bunch of operators for calculations on arrays, lists, vectors etc.
  • A large integrated collection of  tools for data analysis, and visualization.
  • Facilities for data analysis using graphs and display either directly at the computer or paper.
  • A well implemented and effective programming language called ‘S’ on top of which R is built.
  • A complete range of packages to extend and enrich the functionality of R.

We call R an environment within which many classical and modern statistical techniques have been integrated. There are about 25 packages supplied with R and around more than 3000 are available through the Comprehensive R Archive Network (CRAN) family of Internet sites (via http://CRAN.R-project.org) and elsewhere.

Note: Going ahead when I use the word ‘R’ it would be for complete R environment.

When Should You Use R?

Though R is a great software, but it isn’t the right tool for every problem. You should know your problem and limitations of R well before you use it. R is very good at plotting graphics, analyzing data, and fitting statistical models using data that fits in the computer’s memory. It’s not as good at storing data in complicated structures, efficiently querying data, or working with data that doesn’t fit in the computer’s memory.

That’s the reason I used Hadoop to pre-process large files before moving them into R. Though you can do the same task in R as well using R regular expression and binding operations but it would be quite new and complex for a beginner. Also, R uses the RAM to hold dataset. You can’t therefore hold large values in R if you do not have a large memory. To hold large data files, I usually use a database like MySQL, or a framework like Hadoop.

How is R Better Than its Commercial Counterparts?

Capability

There are thousands of statistical and data analysis algorithms in R. None of its counterparts offers this many variety in functionality that is available through the CRAN.

Community

There are million users of R worldwide and they are growing exponentially due to its capabilities. You can always share your knowledge, doubts or suggestion with them through various forums.

Performance

R’s performance is excellent compared to other commercial analysis packages. R loads datasets into memory before processing. The only thing you should have is a good configuration machine to use its functionality to maximum extent. I think everyone can now go for higher memory machines as memories are quite cheap today than the time when R was developed. That’s probably one of the greatest reasons why R users are growing at this pace.

Data Mining Steps:

The overall process of finding and interpreting patterns from data involves the repeated application of the following steps:

  1. Loading and developing an understanding of:
    • The application domain
    • The relevant prior knowledge
    • The goals of the end-user
  2. Creating a target dataset: Selecting a dataset, or focusing on a subset of variables, or data samples, on which discovery is to be performed.
  3. Data cleaning and pre-processing.
    • Removal of noise or outliers.
    • Collecting necessary information to model or account for noise.
    • Strategies for handling missing data fields.
    • Accounting for time sequence information and known changes.
  4. Data reduction and projection.
    • Finding useful features to represent the data depending on the goal of the task.
    • Using dimensionality reduction or transformation methods to reduce the effective number of variables under consideration or to find invariant representations for the data.
  5. Choosing the data mining task.
    • Deciding whether the goal of the KDD process is classification, regression, clustering, etc.
  6. Choosing the data mining algorithm(s).
    • Selecting method(s) to be used for searching for patterns in the data.
    • Deciding which models and parameters may be appropriate.
    • Matching a particular data mining method with the overall criteria of the KDD process.
  7. Data mining.
    • Searching for patterns of interest in a particular representational form or a set of such representations as classification rules or trees, regression, clustering, and so forth.
  8. Interpreting mined patterns.
  9. Consolidating discovered knowledge.

Step 1: Loading and developing an understanding of the data

We import the files that we saved in HDFS in Hadoop from Pig named “combined_out”

## setting Hadoop variables for Hadoop in R environment

Sys.setenv(JAVA_HOME="/home/abhay/java")
Sys.setenv(HADOOP_HOME="/home/abhay/hadoop")
Sys.setenv(HADOOP_CMD="/home/abhay/hadoop/bin/hadoop")
Sys.setenv(HADOOP_STREAMING="/home/abhay/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar")

##loading RHadoop packages

library(rmr2)
library(rhdfs)
hdfs.init()

##setting Hadoop root path and reading files from HDFS

hdfs.root <- '/bank_project'
hdfs.data <- file.path(hdfs.root, 'combined_out/part-r-00000')
final_bank_data <- hdfs.read.text.file(hdfs.data)
content<-hdfs.read.text.file(hdfs.data)
clickpath<-read.table(textConnection(content),sep=",")

Step 2 : Creating a target dataset

##naming all the columns fetched from HDFS

colnames(clickpath) <- c("ac_id","disposal_type","age","sex","card_type","dist","avg_sal","unemp_rate","entrepreneur_no","trans_sum","loan_amount","loan_status")

Step 3 : Data cleaning and pre-processing

##checking structure of the fetched data

str(clickpath)

summary(clickpath) summary_2 (1) (1)

##list of rows with missing values

clickpath[!complete.cases(clickpath),]

##list of columns with missing values

clickpath[,!complete.cases(clickpath)]

## If any missing values are there omit them

clickpath <- na.omit(clickpath,na.action=TRUE)

Step 4 : Data reduction and projection

##selecting only numerical data and removing ac_id column

mydata <- clickpath[,c(3,7:11)]

# First check the complete set of components for outliers

boxplot(mydata) boxplot_whole (1)

## As we can see from the above plots that avg_sal,unemp_rate and loan amount has some outliers in the data. Let’s analyze all these three individually.

# outlier in avg_sal

boxplot(mydata[,c(2)]) box_avg_sal (1)

# Since avg_sal is one of the most useful things which we can’t simply ignore without much investigation. Hence we would check the scatterplot of this entry for more clarity.

plot(mydata[,c(2)])

scater_avg_sal (1)

## from this plot we observe that there are so many entries whose values are in the outlier category. Hence, it would not be a good idea to remove this outlier.

#outlier in unemp_rate

boxplot(mydata[,c(3)]) box_3 (1)

## From this graph we can see that there are a few entries which lie as outliers. Since this value is not that necessary, hence we can decide to reduce outliers from this entry. There may be many ways of removal of outliers. I chose to replace outliers with the maximum values which is ~1.5.

## defining function to replace outliers

library(data.table)
 outlierReplace = function(dataframe, cols, rows, newValue = NA) {
 if (any(rows)) {
 set(dataframe, rows, cols, newValue)
 }
 }

#calling the outlier Replace function for entry unemp_rate to replace all the outliers with maximum category value.

outlierReplace(clickpath, "unemp_rate", which(mydata$unemp_rate > 1.5), 1.5)

## now checking the five num summary of the entry to verify if the outliers has been replaced.

fivenum(mydata$unemp_rate)

#outlier in loan_amount

boxplot(mydata[,c(6)]) box_loan (1)

plot((mydata[,c(6)])) scatter-loan (1)

## from this plot we observe that there are so many entries whose values are in the outlier category. Considering sensitivity of this entry it would not be a good idea to remove this outlier.

##Since the data attributes are of different varieties their scales are also different. In order to maintain uniform scalability we scale the columns.

mydata <- scale(mydata[,1:7])

Step 5 : Choosing the data mining task

## Calculating variance and storing at the first index in wss

wss <- (nrow(mydata)-1)*sum(apply(mydata,2,var))

Step 6 : Choosing the data mining algorithm(s)

##We are going to use k-means algorithm for this clustering.

Clustering analysis??

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are  similar (in some sense or another) to each other than to those in other groups (clusters). It is the main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, and bioinformatics.

Cluster analysis as such is not an automatic task, but an iterative process of knowledge discovery or interactive multi-objective optimization that involves trial and failure.

K-Means Algorithm Properties

There are always K clusters.

There is always at least one item in each cluster.

The clusters are non-hierarchical and they do not overlap.

Every member of a cluster is closer to its cluster than any other cluster because closeness does not always involve the ‘center’ of clusters.

The K-Means Algorithm Process

  • The dataset is partitioned into K clusters and the data points are randomly assigned to the clusters resulting in clusters that have roughly the same number of data points.
  • For each data point: Calculate the distance from the data point to each cluster.
  • If the data point is closest to its own cluster, leave it where it is. If the data point is not closest to its own cluster, move it into the closest cluster.
  • Repeat the above step until a complete pass through all the data points results in no data point moving from one cluster to another. At this point the clusters are stable and the clustering process ends.

The choice of initial partition can greatly affect the final clusters that result in terms of inter-cluster and intra-cluster distances and cohesion.

## iterate through wss array 15 times and sum up all the variance in every iteration and store it in wss array

for(i in 2:15)wss[i]<- sum(fit=kmeans(mydata,centers=i,15)$withinss)

## plot each iteration to display the elbow graph 

plot(1:15,wss,type="b",main="15 clusters",xlab="no. of cluster",ylab="with clsuter sum of squares")

without_outlier_kmeans_plot (1)

Step 7 : Searching for patterns of interest in a particular representational form

##As we can see from the above output the slope of the graph changes majorly in 3 iteration, hence we consider the optimized number of cluster as 3 in which we can get the optimum result

fit <- kmeans(mydata,3)

## Let’s check the summary of the kmeans objects

>fit fit_sam (1)

kmeans returns an object of class which has a print and a fitted method. It is a list with at least the following components:

clusterA vector of integers (from 1:k) indicating the cluster to which each point is allocated.
centersA matrix of cluster centres.
totssThe total sum of squares.
withinssVector of within-cluster sum of squares, one component per cluster.
tot.withinssTotal within-cluster sum of squares, i.e. sum(withinss).
betweenssThe between-cluster sum of squares, i.e. totss-tot.withinss.
sizeThe number of points in each cluster.
iterThe number of (outer) iterations.
ifaultinteger: indicator of a possible algorithm problem – for experts.

## checking withinss i.e. the intra cluster bond strength factor for each cluster

fit$withinss

## checking betweenss i.e. the inter cluster distance between cluster

fit$betweenss
fit$size             

Step 8 :  Interpreting mined patterns

plot(mydata,col=fit$cluster,pch=15)
 points(fit$centers,col=1:8,pch=3)

simple_kmeans_plot (1)

library(cluster)
 library(fpc)
 plotcluster(mydata,fit$cluster)
points(fit$centers,col=1:8,pch=16)

numberplot (1)

clusplot(mydata, fit$cluster, color=TRUE, shade=TRUE, labels=2, lines=0)

circ (1)

## checking mean for each object in each cluster

Usually, as the result of a k-means clustering analysis, we would examine the means for each cluster on each dimension to assess how distinct our k clusters are.

mydata <- clickpath[,c(3,7:12)]
 mydata <- data.frame(mydata,fit$cluster)
 cluster_mean <- aggregate(mydata[,1:8],by = list(fit$cluster),FUN = mean)
 cluster_mean

mean

K-means clustering in particular when using heuristics such as Lloyd’s algorithm is rather easy to implement and apply even on large datasets. It has been successfully used in various fields, including market segmentation, computer vision, geostatistics, astronomy and agriculture. It often is used as a preprocessing step for other algorithms, for example to find a starting configuration.

In the next blog we would learn about application of logistic regression and market basket analysis on bank data.

Got a question for us? Please mention it in the comments section and we will get back to you.

Related Posts:

Implementing Hadoop and R Analytic Skills in Banking Domain

Get Started with Big Data and Hadoop

Get Started with Data Science


About Author
Abhay Kumar
Published on Jun 20,2018
I enjoy being challenged and working on projects that require me to work outside my comfort and knowledge set, as continuing to learn new languages and development techniques are important to me and the success of my organization. My technical expertise includes cross-platform proficiency (Windows, Unix, Linux );fluency in scripting/programming languages (including java/j2ee,R,sql,jquery,html etc); and advanced knowledge of developer applications, tools, methodologies and best practices (including OOD, client/server architecture and self-test automation,web scrapping,json processing,ajax);have extensive knowledge of analytics platform/tools(including hadoop,pig,mapreduce,hbase,R,OLAP CUBES,Data warehousing concepts,reporting libraries like jfreechart(java),highchart(js) etc).

Share on

Browse Categories

Comments
11 Comments