Published on Aug 07,2017
18.9K Views
Email Post

In this blog and the next few ones that will follow, we will analyze a banking domain dataset, which contains several files with details of its customers. This database was prepared by Petr Berka and Marta Sochorova.

The Berka dataset is a collection of financial information from a Czech bank. The dataset deals with over 5,300 bank clients with approximately 1,000,000 transactions. Additionally, the bank represented in the dataset has extended close to 700 loans and issued nearly 900 credit cards, all of which are represented in the data.

By the time you finish reading this blog,  you would have learned :

  • How to analyze a bank’s data to predict a customer’s quality
  • Using this analysis we can categorize a customer into three categories:
  1. Excellent: Customers whose record is good with the bank
  2. Good: Customers who have average earning with a good record till now
  3. Risky: Customers who are under debt of bank or who has not paid the loan on time
  • How to write PIG UDF
  • How to connect Hadoop with R
  • How to load data from Hadoop to R

How to analyze a bank’s data to predict the customer’s quality

Prerequisite

Software Technology

  • Java installed Hadoop concepts
  • Hadoop installed Java concepts
  • Pig installed Pig concepts
  • R-base
  • Rstudio
  • Ubuntu OS

1. Understanding the data

The berka dataset was prepared by Petr Berka and Marta Sochorova. We will refer to the dataset as the Berka Dataset throughout this analysis. The Berka Dataset is a collection of financial information from a Czech bank. The dataset deals with over 5,300 bank clients with approximately 1,000,000 transactions. Additionally, the bank represented in the dataset has extended close to 700 loans and issued nearly 900 credit cards, all of which are represented in the data.

  • account.asc
  • card.asc
  • client.asc
  • disp.asc
  • district.asc
  • loan.asc
  • order.asc
  • trans.asc

Implementing hadoop and R analytic skills in banking domain

2. Data cleaning

 Before we proceed to analysis phase we must understand the raw data. If the raw data needs formatting, we can do pre-processing to clean it.

Implementing hadoop and R analytic skills in banking domain

 Here, I have cleaned the data and re-saved as a CSV file.

 sed 's/"//g' account.asc | sed -e '1d' |sed 's/;/,/g'> acccount.csv

Implementing hadoop and R analytic skills in banking domain

 Similarly, we can clean the rest of the data.

 3. Moving the data to HDFS

 hadoop dfs -put bank_project /user/abhay

 4. Writing PIG script

 Loan data: –

 Loading and picking only required fields

loan = load '/user/itsupport/data_berka/loan.csv' using PigStorage(',') ; 
loan_fields = foreach loan generate $1 as ac_id,$0 as loan_id,$3 as amount,$6 as status;

Grouping on account Id

grp_loan_ac_id = group loan_fields by $0;

Breaking nested bag to store the fields as tuple

grp_loan_ac_id_flatten = foreach grp_loan_ac_id generate FLATTEN(loan_fields);

 Removing the header

filtered_grp = filter grp_loan_ac_id_flatten by $3 != 'status';

 Storing the short file back to HDFS

store filtered_grp into '/bank_project/loan_required_out' using PigStorage(','); Client data: -

 Note : The date of birth (DOB) information is in the format yymm+50dd(for female)yymmdd(for male)

 For calculating the age from the above date format a UDF is written and its jar is registered with PIG execution engine.

REGISTER alljars/pig_age_calculator.jar

Loading and picking only required fields

client = load '/user/itsupport/data_berka/client.csv' using PigStorage(',') AS (Client_id:int,dob:chararray,dist_id:int) ; 
client_fields = foreach client generate $0 as client_id,$2 as district_id,$1 as birthday_n_sex;

 Grouping on client_id

grp_client_id = group client_fields by $0;

 Breaking nested bag to store the fields as tuple

grp_client_flat = FOREACH grp_client_id GENERATE FLATTEN(client_fields);

 Removing the header

B = filter grp_client_flat by $2 != 'birth_number;

Calling Java function to parse the date and finding age and sex of the customer

age = foreach B generate $0,$1, bank.Age_calculator(birthday_n_sex) ; 
 store age into '/bank_project/age_required_out' using PigStorage(',');

 Transaction data:

Loading and picking only required fields

transaction = load '/user/itsupport/data_berka/transaction.csv' using PigStorage(',') as (trans_id:int,ac_id:int,date:chararray,type:chararray,operation:chararray,amount:int,bal:int,k_sym:chararray,bank:int,account:int); 
transaction_fields = foreach transaction generate $1 as ac_id,$2 as date_of_transaction,$3 as transaction_type,$5 as amount,$6 as bal_post_trnsaction;

Picking only transaction of last one year

filtered_trans = filter transaction_fields by (int)SUBSTRING($1,0,2) > 97; 
grp_ac = group filtered_trans by $0;

Summing up the transaction carried out by the user in last one year

MAX_grp_ac = FOREACH grp_ac GENERATE group, SUM(filtered_trans.$3),SUM(filtered_trans.$4); 
store MAX_grp_ac into '/bank_project/transaction_left_bal_required_out' using PigStorage(',');

 Card details data:-

 card = load '/user/itsupport/data_berka/card.csv' using PigStorage(',') ;
card_fields = foreach card generate $1 as disposition_id,$2 as card_type; 
 grp_card_disp_id = group card_fields by $0;
 flatten_card = foreach grp_card_disp_id generate FLATTEN(card_fields);
filtered_card = filter flatten_card by card_type != 'type';
store filtered_card into '/bank_project/card_required_out' using PigStorage(',');

 District data:

Register the jar to calculate the difference in unemployment between two consecutive years 95 and 96

REGISTER alljars/pig_substractjar.jar 
 district = load '/user/itsupport/data_berka/district.csv' using PigStorage(',') AS (dist_id:int,dist_name:chararray,region:chararray,no_inhabs:long,mun_499:int,mun_1999:int,mun_10k:int,mun_more:int,no_of_cities:int,no_of_urban_inhabs:double
,avg_sal:int,unemp_95:double,unemp_96:double,entre_ratio:int); 
district_fields = foreach district generate $0 as district_id,$1 as district_name,$2 as region,$10 as avg_salary,$11 as unemp_rate_95,$12 as unemp_rate_96,$13 as entrepreneur_per_1000;
 grp_dist_id = group district_fields by $0;
 MAX_grp_dist = FOREACH grp_dist_id GENERATE group,FLATTEN(district_fields);
 B = filter MAX_grp_dist by unemp_rate_95 > 0.0 AND unemp_rate_96 > 0.0;
unem_percentage = foreach B generate $1, district_name,avg_salary,bank.substract(unemp_rate_95,unemp_rate_96),entrepreneur_per_1000 ; 
 store unem_percentage into '/bank_project/district_required_out' using PigStorage(',');

 Disposition data:

 disposition = load '/user/itsupport/data_berka/disposition.csv' using PigStorage(',') ;
disposition_fields = foreach disposition generate $2 as ac_id,$0 as disposition_id,$3 as disposition_type,$1 as client_id; 
 grp_disposition_disp_id = group disposition_fields by $1;
 flatten_disposition_disp_id = foreach grp_disposition_disp_id generate FLATTEN(disposition_fields);
 filtered_disposition_disp_id = filter flatten_disposition_disp_id by disposition_type != 'type';

 Joining all the data :-

 Before we start writing script we should design our approach/Algorithm which should be implemented:

  • join loan,transaction,Account,Disposition,on ac_id as ac_id_join
  • join ac_id_join,district_info,client on district_id as include_district
  •  join include_district,card on disposition_id as join_done
  • select loan_amount,loan_duration,loan_status,type,transaction_amount,date,owner_type,district_name,region,avg_salary,unemployment_rate_95,unemployment_rate_96,no_of_enterpreneur/1000,card type,birthday
  • Algorithm used to predict excellent, good and risky customers:

within 1 year {

if transaction_amount > 10 lac and avg_sal > 10k and loan_status==’A’ and (age >25 and age <=65)

write in a file called good more loan can be granted card can be upgrade

if transaction_amount > 10 lac and avg_sal > 6k and loan_status==’A’ and loan_status==’C’ and (age >25 and age <=55) and unemployment_rate < 0.80

write in a file called ok more loan can be granted after completion of the loan card can be upgraded after completion of the loan

if avg_sal > 6k and loan_status==’B’ and loan_status==’D’ and (age >35) and no_of_entrepreneur>100

write in a file called risky no more loans card must be downgraded

}

Client_age = load '/bank_project/age_required_out' using PigStorage(',') AS (client_id:int,dist_id:int,age:double,sex:chararray);
card_type = load '/bank_project/card_required_out' using PigStorage(',') AS (disp_id:int,type:chararray);
transaction_sum = load '/bank_project/transaction_left_bal_required_out' using PigStorage(',') AS (ac_id:int,trans_sum:long,bal_sum:long);
loan_status = load '/bank_project/loan_required_out' using PigStorage(',') AS (ac_id:int,loan_id:int,amount:int,status:chararray);
district_info = load '/bank_project/district_required_out' using PigStorage(',') AS (district_id:int,dist_name:chararray,avg_sal:int,unemprate:double,entrepreneur:int);
join_disp_client = join filtered_disposition_disp_id by $3,Client_age by $0;
join_disp_client_card = join join_disp_client by $1,card_type by $0;
join_disp_client_card_district = join join_disp_client_card by $5,district_info by $0;
join_disp_client_card_district_trans_loan = join join_disp_client_card_district by $0,transaction_sum by $0,loan_status by $0;
pick_fields = foreach join_disp_client_card_district_trans_loan generate $0 as ac_id,$2 as disp_type,$6 as age,$7 as sex,$9 as card_type,$11 as dist_name,$12 as avg_sal,$13 as unemp_rate,$14 as no_of_entre,$16 as transaction_sum,$20 as loan_amount,$21 as loan_status;
store pick_fields into '/bank_project/combined_out' using PigStorage(',');
Good = filter pick_fields by $9 > 1000000 AND $6 > 10000 AND $11 == 'A' AND (($2 >= 25.0 AND $2 <=65.0));
store Good into '/bank_project/VIP_customer' using PigStorage(',');
Normal = filter pick_fields by $9 < 1000000 AND $9 >150000 AND $6 > 6000 AND ($11=='A' OR $11=='C') AND (($2 <= 55.0) AND ($2 >=25.0)) AND $7 < 0.80;
store Normal into '/bank_project/good_customer' using PigStorage(',');
Risky = filter pick_fields by $6 > 6000 AND ($11 == 'B' OR $11 == 'D')
AND $2 > 35.0 AND $8 > 100; 
store Risky into '/bank_project/risky_customer' using PigStorage(',');

Now, we have three categories of customers.

The bank can prepare separate plans for three categories as per their business goals.

In the next blog we will analyze the intermediate output of Hadoop i.e. output file “/bank_project/combined_out” and build clusters on the variable using R.

Got a question for us? Please mention it in the comments section and we will get back to you.

Related Posts:

clustering-on-bank-data-using-R

Get Started with Big Data and Hadoop

About Author
Abhay Kumar
Published on Aug 07,2017
I enjoy being challenged and working on projects that require me to work outside my comfort and knowledge set, as continuing to learn new languages and development techniques are important to me and the success of my organization. My technical expertise includes cross-platform proficiency (Windows, Unix, Linux );fluency in scripting/programming languages (including java/j2ee,R,sql,jquery,html etc); and advanced knowledge of developer applications, tools, methodologies and best practices (including OOD, client/server architecture and self-test automation,web scrapping,json processing,ajax);have extensive knowledge of analytics platform/tools(including hadoop,pig,mapreduce,hbase,R,OLAP CUBES,Data warehousing concepts,reporting libraries like jfreechart(java),highchart(js) etc).

Share on

Browse Categories

Comments
29 Comments