Comparing worksheets from a directory of xlsx files

0 votes

I'm trying to compare worksheets in a lot of different excel files, however it's a tough problem. The issue appears to be as follows:

I have a directory with roughly 1000.xlsx files, each with 16 worksheets and dated sequentially. Each file contains a forecast of the numbers for a specific day using data from four subsequent days plus, if necessary, the previous week's Saturday.

For instance, the filename for the first five worksheets in it would be ['10-12','13-12','14-12','15-12','16-12'] if I were to predict the 17th of December 2022. The 11th and 12th are omitted because the 10th falls on a Saturday.

Lets call this file 1.

This file will be called xxx 1218.xlsx and contain the following 5 sheets: "17-12," "13-12," "14-12," "15-12," and "16-12." The following file in the folder will be the following day, the 18th of December 2022, and contains 4 of the same days plus a new one, which is the actual result (not the day we were predicting) that we were trying to predict. The 17th is a Saturday, therefore the order is irrelevant.

Lets call this file 2.

The challenge now is. I want to make sure that all of the worksheets in file 1, which have the same dates as file 2, have the same information. Make the same comparison after that by iteratively going through each subsequent file. Hence, after comparing the worksheets from files 1 and 2, file 2's worksheets should be compared to files 3 and so on.

I currently have the following, which performs a number of complex acrobatics to obtain the names of the worksheets from the directory, a list of the pertinent ones from the dictionary keys, and then calls them in dataframes to compare them.

import pandas as pd
import os
from datetime import date

path_root =r'C:\Users\Files'
filenames = [file for file in os.listdir(path_root) if file.endswith('.xlsx')]

keys_list = []
df1_keys_list = []
df2_keys_list = []

for i in range(len(filenames)):
    filename = filenames[i]
    filename_compare = filenames[i+1]
    day = filename[-7:-5]
    month = filename[-9:-7]
    day_compare = filename_compare[-7:-5]
    month_compare = filename_compare[-9:-7]
    filename_compare = 'prealign_debug_'+month_compare+day_compare+'.xlsx'
    Path_1 = os.path.join(path_root,filename)
    Path_2 = os.path.join(path_root,filename_compare)
    df1 = pd.read_excel(Path_1, sheet_name=None)
    df2 = pd.read_excel(Path_2, sheet_name=None)
    for j in df1.keys():
    df1_days = df1_keys_list[0:5]
    for k in df2.keys():
    df2_days = df2_keys_list[0:5]
    df3 = pd.read_excel(Path_1, sheet_name=df1_days[1])
    df4 = pd.read_excel(Path_1, sheet_name=df1_days[2])
    df5 = pd.read_excel(Path_1, sheet_name=df1_days[3])
    df6 = pd.read_excel(Path_2, sheet_name=df1_days[0])
    df7 = pd.read_excel(Path_2, sheet_name=df1_days[1])
    df8 = pd.read_excel(Path_2, sheet_name=df1_days[2])
    [print('correct') if df3.equals(df6) else print('incorrect')]
    [print('correct') if df4.equals(df7) else print('incorrect')]
    [print('correct') if df5.equals(df8) else print('incorrect')]

This is a bit cumbersome, probably there's a slicker way of doing it but this is what I have anyway.

Two problems arise, one: Something is wrong with my loop structure here

for j in df1.keys():
    df1_days = df1_keys_list[0:5]

as 95 instead of 5 elements should have been in the list df1 keys list. There might be a more effective approach to create the keys list that references the worksheets.

Second, I can tell it's not functioning since I'm comparing 5 of the exact same files, and when I make a change to one of them, the script still prints a "right" statement, EXCEPT when I make a change to the second file.

How can I tidy this up and what am I doing wrong?

Apr 6, 2023 in Others by Kithuzzz
• 38,010 points

1 answer to this question.

0 votes

Here is a refactored version of your code. Among other things, I inserted an if statement inside the loop for typical days in place of list comprehensions, I changed the keys to lists and took the first 5 elements using list slicing for simplicity.

import pandas as pd
import os

path_root = r'C:\Users\Files'
filenames = [file for file in os.listdir(path_root) if file.endswith('.xlsx')]

def compare_files(file1, file2):
    df1 = pd.read_excel(file1, sheet_name=None)
    df2 = pd.read_excel(file2, sheet_name=None)

    df1_days = list(df1.keys())[:5]
    df2_days = list(df2.keys())[:5]

    common_days = set(df1_days) & set(df2_days)

    for day in common_days:
        if df1[day].equals(df2[day]):
            print(f'{day} in {file1} and {file2} are correct')
            print(f'{day} in {file1} and {file2} are incorrect')

for i in range(len(filenames) - 1):
    filename1 = os.path.join(path_root, filenames[i])
    filename2 = os.path.join(path_root, filenames[i + 1])
    compare_files(filename1, filename2)
answered Apr 6, 2023 by narikkadan
• 63,420 points

Related Questions In Others

0 votes
1 answer
0 votes
1 answer

How to open .xlsx files in MS Excel from VS Code?

Hello, to open xlxs files, or files ...READ MORE

answered Feb 17, 2022 in Others by gaurav
• 23,260 points
0 votes
1 answer

Modifying CSV files from a local folder-VBA

Solution  You need to fully qualify your Columns object ...READ MORE

answered Jan 17, 2023 in Others by narikkadan
• 63,420 points
0 votes
1 answer
0 votes
2 answers
+1 vote
2 answers

how can i count the items in a list?

Syntax :            list. count(value) Code: colors = ['red', 'green', ...READ MORE

answered Jul 7, 2019 in Python by Neha
• 330 points

edited Jul 8, 2019 by Kalgi 4,056 views
0 votes
1 answer
0 votes
1 answer

Is there any way in python to auto-correct spelling mistake in multiple rows of an excel files of a single column?

Use Spellchecker for doing your stuff: import pandas ...READ MORE

answered Oct 14, 2022 in Others by narikkadan
• 63,420 points
0 votes
1 answer
webinar_success Thank you for registering Join Edureka Meetup community for 100+ Free Webinars each month JOIN MEETUP GROUP