Start an iteration on first row of a group Pandas

0 votes

I have a dataset like this:

Policy | Customer | Employee | CoveragDate | LapseDate
123    | 1234     | 1234     | 2011-06-01  | 2015-12-31
124    | 1234     | 1234     | 2016-01-01  | ?
125    | 1234     | 1234     | 2011-06-01  | 2012-01-01
124    | 5678     | 5555     | 2014-01-01  | ?

I'm trying to iterate through each policy for each employee of each customer (a customer can have many employees, an employee can have multiple policies) and compare the covered date against the lapse date for a particular employee. If the covered date and lapse date are within 5 days, I'd like to add that policy to a results list.

So, expected output would be:

Policy | Customer | Employee
123    | 1234     | 1234

because policy 123's lapse date was within 5 days of policy 124's covered date.

I'm running into a problem while trying to iterate through each grouping of Customer/Employee numbers. I'm able to identify how many rows of data are in each EmployeeID/Customer number (EBCN below) group, but I need to reference specific data within those rows to assign variables for comparison.

So far, I've been able to write this code:

import pandas
import datetime

wd = pandas.read_csv(DATASOURCE)
l = 0
for row, i in wd.groupby(['EMPID', 'EBCN']).size().iteritems():
    Covdt = pandas.to_datetime(wd.loc[l, 'CoverageEffDate'])
    for each in range(i):
        LapseDt = wd.loc[l, 'LapseDate']
        if LapseDt != '?':
            LapseDt = pandas.to_datetime(LapseDt) + datetime.timedelta(days=5)
            if Covdt < LapseDt:
                print('got one!')
        l = l + 1

This code is not working because I'm trying to reference the coverage date/lapse dates on a particular row with the loc function, with my row number stored in the 'l' variable. I initially thought that Pandas would iterate through groups in the order they appear in my dataset, so that I could simply start with l=0 (i.e. the first row in the data), assign the coverage date and lapse date variables based on that, and then move on, but it appears that Pandas starts iterating through groups randomly. As a result, I do indeed get a comparison of lapse/coverage dates, but they're not associated with the groups that end up getting output by the code.

The best solution I can figure is to determine what the row number is for the first row of each group and then iterate forward by the number of rows in that group.

I've read through a question regarding finding the first row of a group, and am able to do so by using

wd.groupby(['EMPID','EBCN']).first()

but I haven't been able to figure out what row number the results are stored on in a way that I can reference with the loc function. Is there a way to store the row number for the first row of a group in a variable or something so I can iterate my coverage date and lapse date comparison forward from there?

Regarding my general method, I've read through the question here, which is very close to what I need:

pandas computation in each group

however, I need to compare each policy in the group against each other policy in the group - the question above just compares the last row in each group against the others.

Is there a way to do what I'm attempting in Pandas/Python?

Sep 6, 2018 in Python by bug_seeker
• 15,360 points
334 views

1 answer to this question.

0 votes

For anyone needing this information in the future - I was able to implement Boud's suggestion to use the pandas.merge_asof() function to replace my code above. I had to do some data manipulation to get the desired result:

  1. Splitting the dataframe into two separate frames - one with CoverageDate and one with LapseDate.
  2. Replacing the '?' (null values) in my data with a numpy.nan datatype
  3. Sorting the left and right dataframes by the Date columns

Once the data was in the correct format, I implemented the merge:

pandas.merge_asof(cov, term,
    on='Date',
    by='EMP|EBCN',
    tolerance=pandas.Timedelta('5 days'))

Note 'cov' is my dataframe containing coverage dates, term is the dataframe with lapses. The 'EMP|EBCN' column is a concatenated column of the employee ID and Customer # fields, to allow easy use of the 'by' field.

answered Sep 6, 2018 by Priyaj
• 57,300 points

Related Questions In Python

+1 vote
4 answers

Count the frequency of an item in a python list

To count the number of appearances: from collections ...READ MORE

answered Oct 18, 2018 in Python by tinitales
2,081 views
0 votes
1 answer

How can I find out the index of an element from row and column in Python?

You probably want to use np.ravel_multi_index: [code] import numpy ...READ MORE

answered Apr 16, 2018 in Python by charlie_brown
• 7,720 points
109 views
+1 vote
1 answer

How to print first character of each word in upper case of a string in Python

You can use split() using space as ...READ MORE

answered Jun 1, 2018 in Python by george
• 200 points
2,675 views
0 votes
1 answer

Question on PyQt: How to connect a signal to a slot to start a background operation in Python

It shouldn't matter whether the connection is ...READ MORE

answered Nov 27, 2018 in Python by Nymeria
• 3,520 points
185 views
+1 vote
2 answers

how can i count the items in a list?

Syntax :            list. count(value) Code: colors = ['red', 'green', ...READ MORE

answered Jul 6 in Python by Neha
• 330 points

edited Jul 8 by Kalgi 409 views
0 votes
1 answer

How to sort a list of strings?

Try  items = ["live", "like", "code", "cool", "bug"] ...READ MORE

answered Jul 27, 2018 in Python by Priyaj
• 57,300 points
33 views