Business Analytics with R (35 Blogs) Become a Certified Professional

Implementing Hadoop & R Analytic Skills in Banking Domain

Last updated on Jul 01,2020 22.6K Views
I enjoy being challenged and working on projects that require me to... I enjoy being challenged and working on projects that require me to work outside my comfort and knowledge set, as continuing to learn new...
Implementing Hadoop & R Analytic Skills in Banking Domain

In this blog and the next few ones that will follow, we will analyze a banking domain dataset, which contains several files with details of its customers. This database was prepared by Petr Berka and Marta Sochorova.

The Berka dataset is a collection of financial information from a Czech bank. The dataset deals with over 5,300 bank clients with approximately 1,000,000 transactions. Additionally, the bank represented in the dataset has extended close to 700 loans and issued nearly 900 credit cards, all of which are represented in the data.

By the time you finish reading this blog,  you would have learned :

  • How to analyze a bank’s data to predict a customer’s quality
  • Using this analysis we can categorize a customer into three categories:
  1. Excellent: Customers whose record is good with the bank
  2. Good: Customers who have average earning with a good record till now
  3. Risky: Customers who are under debt of bank or who has not paid the loan on time
  • How to write PIG UDF
  • How to connect Hadoop with R
  • How to load data from Hadoop to R

How to analyze a bank’s data to predict the customer’s quality

Prerequisite

Software Technology

  • Java installed Hadoop concepts
  • Hadoop installed Java concepts
  • Pig installed Pig concepts
  • R-base
  • Rstudio
  • Ubuntu OS

1. Understanding the data

The berka dataset was prepared by Petr Berka and Marta Sochorova. We will refer to the dataset as the Berka Dataset throughout this analysis. The Berka Dataset is a collection of financial information from a Czech bank. The dataset deals with over 5,300 bank clients with approximately 1,000,000 transactions. Additionally, the bank represented in the dataset has extended close to 700 loans and issued nearly 900 credit cards, all of which are represented in the data.

  • account.asc
  • card.asc
  • client.asc
  • disp.asc
  • district.asc
  • loan.asc
  • order.asc
  • trans.asc

Implementing hadoop and R analytic skills in banking domain

2. Data cleaning

 Before we proceed to analysis phase we must understand the raw data. If the raw data needs formatting, we can do pre-processing to clean it.

Implementing hadoop and R analytic skills in banking domain

 Here, I have cleaned the data and re-saved as a CSV file.

 sed 's/"//g' account.asc | sed -e '1d' |sed 's/;/,/g'> acccount.csv

Implementing hadoop and R analytic skills in banking domain

 Similarly, we can clean the rest of the data.

 3. Moving the data to HDFS

 hadoop dfs -put bank_project /user/abhay

 4. Writing PIG script

 Loan data: –

 Loading and picking only required fields

loan = load '/user/itsupport/data_berka/loan.csv' using PigStorage(',') ; 
loan_fields = foreach loan generate $1 as ac_id,$0 as loan_id,$3 as amount,$6 as status;

Grouping on account Id

grp_loan_ac_id = group loan_fields by $0;

Breaking nested bag to store the fields as tuple

grp_loan_ac_id_flatten = foreach grp_loan_ac_id generate FLATTEN(loan_fields);

 Removing the header

filtered_grp = filter grp_loan_ac_id_flatten by $3 != 'status';

 Storing the short file back to HDFS

store filtered_grp into '/bank_project/loan_required_out' using PigStorage(','); Client data: -

 Note : The date of birth (DOB) information is in the format yymm+50dd(for female)yymmdd(for male)

 For calculating the age from the above date format a UDF is written and its jar is registered with PIG execution engine.

REGISTER alljars/pig_age_calculator.jar

Loading and picking only required fields

client = load '/user/itsupport/data_berka/client.csv' using PigStorage(',') AS (Client_id:int,dob:chararray,dist_id:int) ; 
client_fields = foreach client generate $0 as client_id,$2 as district_id,$1 as birthday_n_sex;

 Grouping on client_id

grp_client_id = group client_fields by $0;

 Breaking nested bag to store the fields as tuple

grp_client_flat = FOREACH grp_client_id GENERATE FLATTEN(client_fields);

 Removing the header

B = filter grp_client_flat by $2 != 'birth_number;

Calling Java function to parse the date and finding age and sex of the customer

age = foreach B generate $0,$1, bank.Age_calculator(birthday_n_sex) ; 
 store age into '/bank_project/age_required_out' using PigStorage(',');

 Transaction data:

Loading and picking only required fields

transaction = load '/user/itsupport/data_berka/transaction.csv' using PigStorage(',') as (trans_id:int,ac_id:int,date:chararray,type:chararray,operation:chararray,amount:int,bal:int,k_sym:chararray,bank:int,account:int); 
transaction_fields = foreach transaction generate $1 as ac_id,$2 as date_of_transaction,$3 as transaction_type,$5 as amount,$6 as bal_post_trnsaction;

Picking only transaction of last one year

filtered_trans = filter transaction_fields by (int)SUBSTRING($1,0,2) > 97; 
grp_ac = group filtered_trans by $0;

Summing up the transaction carried out by the user in last one year

MAX_grp_ac = FOREACH grp_ac GENERATE group, SUM(filtered_trans.$3),SUM(filtered_trans.$4); 
store MAX_grp_ac into '/bank_project/transaction_left_bal_required_out' using PigStorage(',');

 Card details data:-

 card = load '/user/itsupport/data_berka/card.csv' using PigStorage(',') ;
card_fields = foreach card generate $1 as disposition_id,$2 as card_type; 
 grp_card_disp_id = group card_fields by $0;
 flatten_card = foreach grp_card_disp_id generate FLATTEN(card_fields);
filtered_card = filter flatten_card by card_type != 'type';
store filtered_card into '/bank_project/card_required_out' using PigStorage(',');

 District data:

Register the jar to calculate the difference in unemployment between two consecutive years 95 and 96

REGISTER alljars/pig_substractjar.jar 
 district = load '/user/itsupport/data_berka/district.csv' using PigStorage(',') AS (dist_id:int,dist_name:chararray,region:chararray,no_inhabs:long,mun_499:int,mun_1999:int,mun_10k:int,mun_more:int,no_of_cities:int,no_of_urban_inhabs:double
,avg_sal:int,unemp_95:double,unemp_96:double,entre_ratio:int); 
district_fields = foreach district generate $0 as district_id,$1 as district_name,$2 as region,$10 as avg_salary,$11 as unemp_rate_95,$12 as unemp_rate_96,$13 as entrepreneur_per_1000;
 grp_dist_id = group district_fields by $0;
 MAX_grp_dist = FOREACH grp_dist_id GENERATE group,FLATTEN(district_fields);
 B = filter MAX_grp_dist by unemp_rate_95 > 0.0 AND unemp_rate_96 > 0.0;
unem_percentage = foreach B generate $1, district_name,avg_salary,bank.substract(unemp_rate_95,unemp_rate_96),entrepreneur_per_1000 ; 
 store unem_percentage into '/bank_project/district_required_out' using PigStorage(',');

 Disposition data:

 disposition = load '/user/itsupport/data_berka/disposition.csv' using PigStorage(',') ;
disposition_fields = foreach disposition generate $2 as ac_id,$0 as disposition_id,$3 as disposition_type,$1 as client_id; 
 grp_disposition_disp_id = group disposition_fields by $1;
 flatten_disposition_disp_id = foreach grp_disposition_disp_id generate FLATTEN(disposition_fields);
 filtered_disposition_disp_id = filter flatten_disposition_disp_id by disposition_type != 'type';

 Joining all the data :-

 Before we start writing script we should design our approach/Algorithm which should be implemented:

  • join loan,transaction,Account,Disposition,on ac_id as ac_id_join
  • join ac_id_join,district_info,client on district_id as include_district
  •  join include_district,card on disposition_id as join_done
  • select loan_amount,loan_duration,loan_status,type,transaction_amount,date,owner_type,district_name,region,avg_salary,unemployment_rate_95,unemployment_rate_96,no_of_enterpreneur/1000,card type,birthday
  • Algorithm used to predict excellent, good and risky customers:

within 1 year {

if transaction_amount > 10 lac and avg_sal > 10k and loan_status==’A’ and (age >25 and age <=65)

write in a file called good more loan can be granted card can be upgrade

if transaction_amount > 10 lac and avg_sal > 6k and loan_status==’A’ and loan_status==’C’ and (age >25 and age <=55) and unemployment_rate < 0.80

write in a file called ok more loan can be granted after completion of the loan card can be upgraded after completion of the loan

if avg_sal > 6k and loan_status==’B’ and loan_status==’D’ and (age >35) and no_of_entrepreneur>100

write in a file called risky no more loans card must be downgraded

}

Client_age = load '/bank_project/age_required_out' using PigStorage(',') AS (client_id:int,dist_id:int,age:double,sex:chararray);
card_type = load '/bank_project/card_required_out' using PigStorage(',') AS (disp_id:int,type:chararray);
transaction_sum = load '/bank_project/transaction_left_bal_required_out' using PigStorage(',') AS (ac_id:int,trans_sum:long,bal_sum:long);
loan_status = load '/bank_project/loan_required_out' using PigStorage(',') AS (ac_id:int,loan_id:int,amount:int,status:chararray);
district_info = load '/bank_project/district_required_out' using PigStorage(',') AS (district_id:int,dist_name:chararray,avg_sal:int,unemprate:double,entrepreneur:int);
join_disp_client = join filtered_disposition_disp_id by $3,Client_age by $0;
join_disp_client_card = join join_disp_client by $1,card_type by $0;
join_disp_client_card_district = join join_disp_client_card by $5,district_info by $0;
join_disp_client_card_district_trans_loan = join join_disp_client_card_district by $0,transaction_sum by $0,loan_status by $0;
pick_fields = foreach join_disp_client_card_district_trans_loan generate $0 as ac_id,$2 as disp_type,$6 as age,$7 as sex,$9 as card_type,$11 as dist_name,$12 as avg_sal,$13 as unemp_rate,$14 as no_of_entre,$16 as transaction_sum,$20 as loan_amount,$21 as loan_status;
store pick_fields into '/bank_project/combined_out' using PigStorage(',');
Good = filter pick_fields by $9 > 1000000 AND $6 > 10000 AND $11 == 'A' AND (($2 >= 25.0 AND $2 <=65.0));
store Good into '/bank_project/VIP_customer' using PigStorage(',');
Normal = filter pick_fields by $9 < 1000000 AND $9 >150000 AND $6 > 6000 AND ($11=='A' OR $11=='C') AND (($2 <= 55.0) AND ($2 >=25.0)) AND $7 < 0.80;
store Normal into '/bank_project/good_customer' using PigStorage(',');
Risky = filter pick_fields by $6 > 6000 AND ($11 == 'B' OR $11 == 'D')
AND $2 > 35.0 AND $8 > 100; 
store Risky into '/bank_project/risky_customer' using PigStorage(',');

Now, we have three categories of customers.

The bank can prepare separate plans for three categories as per their business goals.

In the next blog we will analyze the intermediate output of Hadoop i.e. output file “/bank_project/combined_out” and build clusters on the variable using R.

Got a question for us? Please mention it in the comments section and we will get back to you.

Related Posts:

clustering-on-bank-data-using-R

Get Started with Big Data and Hadoop

Comments
31 Comments
  • lelouch_0 says:

    getting error could not find age calculator.jar

  • lelouch_0 says:

    getting error : bank.Age_calculator could not be found

    • EdurekaSupport says:

      +Lelouch_0, thanks for checking out our blog. The above error occurs if you haven’t created a User Defined Function for calculating the age from the format given in the data and its jar files need to be registered with the PIG execution engine. Hope this helps. Cheers!

  • M says:

    Hi,

    sed ‘s/”//g’ account.asc | sed -e ‘1d’ |sed ‘s/;/,/g’> acccount.csv is not working.. Can you please suggest anything?

    • EdurekaSupport says:

      Hey M, thanks for checking out our tutorial! Could you please share the details of the error you are getting so that we can support you better? Cheers!

  • Raza Raza says:

    Hy data is helpful for me but can i found a datset of Banking Domain??

  • Peter Joy says:

    Hi Abhay i m new to hadoop i m trying to do this process to get practice .can get some ideas how to create age calculation logic and code as well not as jar i want to see the code logic, also m new to java too.. please do me a favor.. thanks in advance

  • Meera Rajan says:

    hello friends
    Im trying few statistical operations to calculate customer potentiality usimg R tool. I need information for literature survey ,such as , how far “R” tool is of use in bank domain.Areas where it is aggressively used, its.Merits compared to other tools in bank domain.any exclusive papers available on it. Few papers , i browse are not pinpoint to the concept.

    • EdurekaSupport says:

      Hey Meera, thanks for checking out our blog. Business analytics is a fast growing field and there are many tools available in the market to serve the needs of organizations. The range of analytical software goes from relatively simple statistical tools in spreadsheets (ex-MS Excel) to statistical software packages (ex-KXEN, Statistica) to sophisticated business intelligence suites (ex-SAS, Oracle, SAP, IBM among the big players). Open source tools like R and Weka are also gaining popularity. Besides these, companies develop in-house tools designed for specific purposes.
      Merits of R:
      R is the most comprehensive statistical analysis package available.It incorporates all of the standard statistical tests, models, and analyses, as well as providing a comprehensive language for managing and manipulating data. New technology and ideas often appear first in R.
      ˆ The graphical capabilities of R are outstanding, providing a fully programmable graphics language that surpasses most other statistical and graphical packages.
      ˆ R is free and open source software, allowing anyone to use and, importantly, to modify it. R is licensed under the GNU General Public License, with copyright held by The R Foundation for Statistical Computing.
      ˆ R has no license restrictions (other than ensuring our freedom to use it at our own discretion), and so we can run it anywhere and at any time, and even sell it under the conditions of the license.
      ˆ Anyone is welcome to provide bug xes, code enhancements, and new packages, and the wealth of quality packages available for R is a testament to this approach to software development and sharing.
      R has over 4800 packages available from multiple repositories specializing in topics like econometrics, data mining, spatial analysis, and bio-informatics.
      R is cross-platform. R runs on many operating systems and diff erent hardware. It is popularly used on GNU/Linux, Macintosh, and Microsoft Windows, running on both 32 and 64 bit processors. ˆ
      R plays well with many other tools, importing data, for example, from CSV les, SAS, and SPSS, or directly from Microsoft Excel, Microsoft Access, Oracle, MySQL, and SQLite. It can also produce graphics output in PDF, JPG, PNG, and SVG formats, and table output for LATEX and HTML.
      Demerits of R
      R has a steep learning curve it does take a while to get used to the power of R but no steeper than for other statistical languages. ˆ R is not so easy to use for the novice. There are several simple-to use graphical user interfaces (GUIs) for R that encompass point and-click interactions, but they generally do not have the polish of the commercial offerings.
      ˆ Documentation is sometimes patchy and terse, and impenetrable to the non-statistician. However, some very high-standard books are increasingly plugging the documentation gaps.
      ˆ The quality of some packages is less than perfect, although if a package is useful to many people, it will quickly evolve into a very robust product through collaborative efforts.
      ˆ Many R commands give little thought to memory management, and so R can very quickly consume all available memory. This can be a restriction when doing data mining. There are various solutions, including using 64 bit operating systems that can access much more memory than 32 bit ones.
      Here’s a related blog: http://www.edureka.co/blog/clustering-on-bank-data-using-r/. Cheers!

  • jassy says:

    can i get jar files of age

    • EdurekaSupport says:

      Hey Jassy, thanks for checking out the blog. To calculate the age from the above date format, a UDF is written and its jar is registered with PIG execution engine.
      REGISTER alljars/pig_age_calculator.jar
      You can type this command in the terminal and proceed in accordance with the mentioned steps. Hope this helps.

      • Rajendra Prasad Balige says:

        grunt> REGISTER alljars/pig_age_calculator.jar
        2016-12-09 12:16:34,002 [main] ERROR org.apache.pig.tools.grunt.Grunt – ERROR 101: file ‘alljars/pig_age_calculator.jar’ does not exist.
        Details at logfile: /home/training/Desktop/pig_1481264251880.log

  • Steve Farr says:

    You could acheive a lot of the leg work here in RapidMiner and just incorporate your R scripts as an operator

  • suresh says:

    Hi Abhay.. what is R and Rbase, Rstudio…?

    • EdurekaSupport says:

      Hi Suresh, R is a programming language which helps you to perform statistical analysis, data visualization on a given data. RBase in nothing but relational database. Rstudio gives you an UI to work with R.

  • prashanth says:

    i’m confused with the date format for the female for writing the UDF.could u please explain a bit about it

    • EdurekaSupport says:

      Hi Prashanth, the date of birth (DOB) information is in the format yymm+50dd(for female)yymmdd(for male)

      Example : 701230 is 30th dec 1970 for a male customer.The same date for female customer would be : 706230 i.e. 50 is added in the month of a female customer.

      Hope this helps!

      • prashanth reddy says:

        thanks for your help sir.. :)

    • faheen fathima says:

      hi prashanth,
      can you please share the udf codes? It would be a great help!

Join the discussion

Browse Categories

webinar REGISTER FOR FREE WEBINAR
REGISTER NOW
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP

Subscribe to our Newsletter, and get personalized recommendations.