Is there a way to read and alter the contents of a huge csv file in PyCharm?

  csv, large-files, pandas, pycharm, python

I’m attempting to create a program currently that can read a csv, determine if a substring is included in one of the columns of each row, and if it isn’t present, rewrites certain columns to a new csv. I have the code down for this much- but the csv I need to use the program for has well over 3 million rows. I use PyCharm and currently I’m not able to process this much data. It can only view the csv in a read-only format which doesn’t allow me to use it. I know pandas has a chunk size feature but I don’t know how to implement this with the rest of my code.

def reading(csv_input):
    originalLength = 0
    rowCount = 0
    with open(f'Web Report {csv_input}', 'w') as file:
        writer = csv.writer(file)
        writer.writerow(['Index', 'URL Category', 'User IP', 'URL'])
        dropCount = 0
        data = pd.read_csv(csv_input, chunksize=100000)
        df = pd.DataFrame(data,
                          columns=['Line', 'Date', 'Hour', 'User Name', 'User IP', 'Site Name',
                                   'URL Category', 'Action', 'Action Description'])
        originalLength = len(df.index)
        for line in range(originalLength):
            dataLine = df.loc[line]
            x = dataLine.get(key='Action')
            if x == 0:
                siteName = dataLine.get(key='Site Name')
                if 'dbk' in siteName:
                    dropCount = dropCount + 1
                elif 'ptc' in siteName:
                    dropCount = dropCount + 1
                elif 'wcf' in siteName:
                    dropCount = dropCount + 1
                elif 'google' in siteName:
                    dropCount = dropCount + 1
                else:
                    writer.writerow([line,  # Original Index
                                     df.loc[line].get(key='URL Category'),  # Original URL Category
                                     df.loc[line].get(key='User IP'),  # Original User IP
                                     df.loc[line].get(key='Site Name')])  # Original Site Name
                    rowCount = rowCount + 1
            else:
                dropCount = dropCount + 1
    file.close()
    print("Input: " + str(csv_input))
    print("Output: " + str(file.name))
    print("Original Length: " + str(originalLength))
    print("Current Length: " + str(rowCount))
    print("Drop Count: " + str(dropCount) + "n")

    return df

Source: Python Questions

LEAVE A COMMENT