Implementing K-means Clustering on Bank Data Using R

The core features of R includes:

Effective and fast data handling and storage facility.

A bunch of operators for calculations on arrays, lists, vectors etc.

A large integrated collection of tools for data analysis, and visualization.

Facilities for data analysis using graphs and display either directly at the computer or paper.

A well implemented and effective programming language called ‘S’ on top of which R is built.

A complete range of packages to extend and enrich the functionality of R.

We call R an environment within which many classical and modern statistical techniques have been integrated. There are about 25 packages supplied with R and around more than 3000 are available through the Comprehensive R Archive Network (CRAN) family of Internet sites (via http://CRAN.R-project.org) and elsewhere.

Note: Going ahead when I use the word ‘R’ it would be for complete R environment.

When Should You Use R?

Though R is a great software, but it isn’t the right tool for every problem. You should know your problem and limitations of R well before you use it. R is very good at plotting graphics, analyzing data, and fitting statistical models using data that fits in the computer’s memory. It’s not as good at storing data in complicated structures, efficiently querying data, or working with data that doesn’t fit in the computer’s memory.

That’s the reason I used Hadoop to pre-process large files before moving them into R. Though you can do the same task in R as well using R regular expression and binding operations but it would be quite new and complex for a beginner. Also, R uses the RAM to hold dataset. You can’t therefore hold large values in R if you do not have a large memory. To hold large data files, I usually use a database like MySQL, or a framework like Hadoop.

How is R Better Than its Commercial Counterparts?

Capability

There are thousands of statistical and data analysis algorithms in R. None of its counterparts offers this many variety in functionality that is available through the CRAN.

Community

There are million users of R worldwide and they are growing exponentially due to its capabilities. You can always share your knowledge, doubts or suggestion with them through various forums.

Performance

R’s performance is excellent compared to other commercial analysis packages. R loads datasets into memory before processing. The only thing you should have is a good configuration machine to use its functionality to maximum extent. I think everyone can now go for higher memory machines as memories are quite cheap today than the time when R was developed. That’s probably one of the greatest reasons why R users are growing at this pace.

Data Mining Steps:

The overall process of finding and interpreting patterns from data involves the repeated application of the following steps:

Loading and developing an understanding of:

The application domain
The relevant prior knowledge
The goals of the end-user

Creating a target dataset: Selecting a dataset, or focusing on a subset of variables, or data samples, on which discovery is to be performed.

Data cleaning and pre-processing.

Removal of noise or outliers.
Collecting necessary information to model or account for noise.
Strategies for handling missing data fields.
Accounting for time sequence information and known changes.

Data reduction and projection.

Finding useful features to represent the data depending on the goal of the task.
Using dimensionality reduction or transformation methods to reduce the effective number of variables under consideration or to find invariant representations for the data.

Choosing the data mining task.

Deciding whether the goal of the KDD process is classification, regression, clustering, etc.

Choosing the data mining algorithm(s).

Selecting method(s) to be used for searching for patterns in the data.
Deciding which models and parameters may be appropriate.
Matching a particular data mining method with the overall criteria of the KDD process.

Data mining.

Searching for patterns of interest in a particular representational form or a set of such representations as classification rules or trees, regression, clustering, and so forth.

Interpreting mined patterns.

Consolidating discovered knowledge.

Sys.setenv(JAVA_HOME="/home/abhay/java") Sys.setenv(HADOOP_HOME="/home/abhay/hadoop") Sys.setenv(HADOOP_CMD="/home/abhay/hadoop/bin/hadoop") Sys.setenv(HADOOP_STREAMING="/home/abhay/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.2.0.jar")

hdfs.root <- '/bank_project' hdfs.data <- file.path(hdfs.root, 'combined_out/part-r-00000') final_bank_data <- hdfs.read.text.file(hdfs.data) content<-hdfs.read.text.file(hdfs.data) clickpath<-read.table(textConnection(content),sep=",")

Bhagya Shree says:
Aug 22, 2017 at 8:51 am GMT
how to start doing kaggle projects let me know plz. i have already done with all ml models. i need to explore more and how to start doing it in r tool.abhay help.. with this. topic.
Reply
Shinu P says:
May 22, 2017 at 6:03 am GMT
Do you have the dataset used for this analysis? and could you provide the same.
Reply
Shivani Arora says:
Feb 6, 2017 at 3:40 pm GMT
hey! can you please provide a python code for this??
Reply
- EdurekaSupport says:
  Feb 10, 2017 at 7:03 am GMT
  Hey Shivani, thanks for checking out our blog. Sorry to say that we do not have Python code for this blog. Do check out our other blogs too. Cheers!
  Reply
abhay kumar says:
Mar 21, 2016 at 6:26 am GMT
It depends upon the objective of your project??If you share the use case i can help.
Reply
swagatika says:
Feb 28, 2016 at 5:44 pm GMT
i did work with hadoop for kmeans (i hv taken text based unstured data)…and i got clusters …..then how i proceed my project plz help me…………….
Reply
- EdurekaSupport says:
  Mar 1, 2016 at 6:42 am GMT
  Hello! Can you please give us more details about your project and where you need the help. Our technical team is waiting to assist you!
  Reply
Ernesto Armijo says:
Oct 18, 2015 at 6:15 pm GMT
Great tutorial Abhay. Is there any change we can get the dataset CLICKPATH to follow the steps one by one? It would be great. But thanks again for sharing.
Reply
Romina says:
Jul 16, 2015 at 10:10 pm GMT
I cant install package “rmrr2” and “rhdfs” on R 3.2.0. Error message is ”
Warning in install.packages :package ‘rmr2’ is not available (for R version 3.2.0)”. please help ?”
Reply
Shawn says:
Jul 15, 2015 at 6:09 pm GMT
Hi Abhay, Thanks for sharing! The codes and explanations are neat! Just have one question, I am dealing with data on excel with more than 50k rows. I tried kmean, hierarchical clustering, model based clustering analysis. Only kmean work out because of the large data size. However, clusters by kmean cannot really show differentiation between clusters. So I am wondering is there any other way to do clustering analysis? Thanks!
Reply
- max says:
  Jul 31, 2015 at 12:51 pm GMT
  Hello,
  @Shawn,Kmeans is used for numerci variable
  you shoud use hierchical clustring
  hope that help you
  Reply

`cluster`	A vector of integers (from 1:k) indicating the cluster to which each point is allocated.
`centers`	A matrix of cluster centres.
`totss`	The total sum of squares.
`withinss`	Vector of within-cluster sum of squares, one component per cluster.
`tot.withinss`	Total within-cluster sum of squares, i.e. `sum(withinss)`.
`betweenss`	The between-cluster sum of squares, i.e. `totss-tot.withinss`.
`size`	The number of points in each cluster.
`iter`	The number of (outer) iterations.
`ifault`	integer: indicator of a possible algorithm problem – for experts.

A Beginner's guide to "What is R Programming?"

R Tutorial - A Beginner's Guide to Learn R Programming

R Programming – Beginners Guide To R Programming Language

Top 10 Reasons to Learn R

Top 50 R Interview Questions You Must Prepare in 2025

ggplot2 Tutorial: Data Visualization Using ggplot2 Package

Top 65 Data Analyst Interview Questions and Answers In 2025

Tutorial on Importing Data in R Commander

Implementing Hadoop & R Analytic Skills in Banking Domain

Implementing K-means Clustering to Classify Bank Customer Using R

Cluster Analysis Steps in Business Analytics with R

Data Science

Implementing K-means Clustering to Classify Bank Customer Using R

The core features of R includes:

When Should You Use R?

How is R Better Than its Commercial Counterparts?

Data Mining Steps:

Recommended videos for you

Machine Learning with Python

3 Scenarios Where Predictive Analytics is a Must

Linear Regression With R

Python Numpy Tutorial – Arrays In Python

Data Science : Make Smarter Business Decisions

Python List, Tuple, String, Set And Dictonary – Python Sequences

Know The Science Behind Product Recommendation With R Programming

Introduction to Business Analytics with R

Android Development : Using Android 5.0 Lollipop

Web Scraping And Analytics With Python

Diversity Of Python Programming

The Whys and Hows of Predictive Modeling-II

Python Programming – Learn Python Programming From Scratch

The Whys and Hows of Predictive Modelling-I

Business Analytics Decision Tree in R

Sentiment Analysis In Retail Domain

Python Loops – While, For and Nested Loops in Python Programming

Python for Big Data Analytics

Python Classes – Python Programming Tutorial

Mastering Python : An Excellent tool for Web Scraping and Data Analysis

Recommended blogs for you

Python Pandas Tutorial : Learn Pandas for Data Analysis

Advantages of Data Science Training

Who uses R?

Python Requests Tutorial: GET and POST Requests in Python

R Programming – Beginners Guide To R Programming Language

How To Implement 2-D arrays in Python?

Introduction to Atom Python Text Editor and how to configure it

Future Scope of Data Science in 2025

Apriori Algorithm : Know How to Find Frequent Itemsets

Python String Concatenation : Everything You Need To Know

What is Queue Data Structure In Python?

Statistics for Machine Learning: A Beginner’s Guide

Introduction To Markov Chains With Examples – Markov Chains With Python

Top Python Libraries You Must Know In 2025

Tkinter Tutorial For Beginners | GUI Programming Using Tkinter In Python

Different Job Titles for Data Scientists

PyCharm Tutorial: Writing Python Code In PyCharm (IDE)

SQL For Data Science: One stop Solution for Beginners

Python vs C++: Know what are the differences

What is Socket Programming in Python and how to master it?

Join the discussionCancel reply

Trending Courses in Data Science

Python Programming Certification Course

Data Science with Python Certification Course

Data Science and Machine Learning Internship ...

Statistics Essentials for Analytics

SAS Training and Certification

Data Analytics with R Programming Certificati ...

Data Science with R Programming Certification ...

Advanced Python for Data Analytics by PwC Aca ...

Analytics for Retail Banks

Decision Tree Modeling Using R Certification ...

Browse Categories

Subscribe to our Newsletter, and get personalized recommendations.

Implementing K-means Clustering to Classify Bank Customer Using R