Start an iteration on first row of a group Pandas

Question

I have a dataset like this:

Policy | Customer | Employee | CoveragDate | LapseDate
123    | 1234     | 1234     | 2011-06-01  | 2015-12-31
124    | 1234     | 1234     | 2016-01-01  | ?
125    | 1234     | 1234     | 2011-06-01  | 2012-01-01
124    | 5678     | 5555     | 2014-01-01  | ?

I'm trying to iterate through each policy for each employee of each customer (a customer can have many employees, an employee can have multiple policies) and compare the covered date against the lapse date for a particular employee. If the covered date and lapse date are within 5 days, I'd like to add that policy to a results list.

So, expected output would be:

Policy | Customer | Employee
123    | 1234     | 1234

because policy 123's lapse date was within 5 days of policy 124's covered date.

I'm running into a problem while trying to iterate through each grouping of Customer/Employee numbers. I'm able to identify how many rows of data are in each EmployeeID/Customer number (EBCN below) group, but I need to reference specific data within those rows to assign variables for comparison.

So far, I've been able to write this code:

import pandas
import datetime

wd = pandas.read_csv(DATASOURCE)
l = 0
for row, i in wd.groupby(['EMPID', 'EBCN']).size().iteritems():
    Covdt = pandas.to_datetime(wd.loc[l, 'CoverageEffDate'])
    for each in range(i):
        LapseDt = wd.loc[l, 'LapseDate']
        if LapseDt != '?':
            LapseDt = pandas.to_datetime(LapseDt) + datetime.timedelta(days=5)
            if Covdt < LapseDt:
                print('got one!')
        l = l + 1

This code is not working because I'm trying to reference the coverage date/lapse dates on a particular row with the loc function, with my row number stored in the 'l' variable. I initially thought that Pandas would iterate through groups in the order they appear in my dataset, so that I could simply start with l=0 (i.e. the first row in the data), assign the coverage date and lapse date variables based on that, and then move on, but it appears that Pandas starts iterating through groups randomly. As a result, I do indeed get a comparison of lapse/coverage dates, but they're not associated with the groups that end up getting output by the code.

The best solution I can figure is to determine what the row number is for the first row of each group and then iterate forward by the number of rows in that group.

I've read through a question regarding finding the first row of a group, and am able to do so by using

wd.groupby(['EMPID','EBCN']).first()

but I haven't been able to figure out what row number the results are stored on in a way that I can reference with the loc function. Is there a way to store the row number for the first row of a group in a variable or something so I can iterate my coverage date and lapse date comparison forward from there?

however, I need to compare each policy in the group against each other policy in the group - the question above just compares the last row in each group against the others.

Is there a way to do what I'm attempting in Pandas/Python?

Priyaj · Answer 1 · Sep 6, 2018

For anyone needing this information in the future - I was able to implement Boud's suggestion to use the pandas.merge_asof() function to replace my code above. I had to do some data manipulation to get the desired result:

Splitting the dataframe into two separate frames - one with CoverageDate and one with LapseDate.
Replacing the '?' (null values) in my data with a numpy.nan datatype
Sorting the left and right dataframes by the Date columns

Once the data was in the correct format, I implemented the merge:

pandas.merge_asof(cov, term,
    on='Date',
    by='EMP|EBCN',
    tolerance=pandas.Timedelta('5 days'))

Note 'cov' is my dataframe containing coverage dates, term is the dataframe with lapses. The 'EMP|EBCN' column is a concatenated column of the employee ID and Customer # fields, to allow easy use of the 'by' field.