I have a CSV file that when I open in notepad, displays as follows:
A,B C, D,E,F,G,H
I see that it shows up as Unix (LF) and UTF-8 at the bottom right of the status bar. When I open the file in excel, save it (but without making any changes), and close it, it will convert it to Windows (CRLF) as expected and displays as follows in notepad:
A,B,,, C,,,, D,E,F,G,H
The header row is the third row (D,E,F,G,H) and my understanding is that prior to saving, Excel reads the entire CSV file, figures out that the longest row has 4 commas and uses that format throughout the entire file. The problem I’m running into is reading the original LF CSV file into Pandas .read_csv. I think I’ve narrowed down the solution to 2 possible options (but please correct me if i’m wrong):
Option 1: In my main python script, I start with a def function that just iterates through every csv file in a folder where I open, save, and close in order to format it into CRLF prior to working with the csv files in Pandas.
Option 2: Format the csv file upon reading it into Pandas. I feel like this is the better option, especially knowing the number of columns I have and using .read_csv(header = 3) but when I open the output file and run excel formulas, calculation times are insane, even for relatively small files. I have a feeling it’s a datatype issue but I’m still new to all of this. Any clarification or resources are greatly appreciated!
Source: Python Questions