What is the fastest way to load data from multiple csv files

I am working with multiple csv files, each containing multiple 1D data. I have about 9000 such files and total combined data is about 40 GB.

I have written a dataloader like this:

class data_gen(torch.utils.data.Dataset):
    def __init__(self, files):
        
        self.files = files
        my_data = np.genfromtxt('/data/'+files, delimiter=',')
        self.dim = my_data.shape[1]
        self.data = []
        
    def __getitem__(self, i):

        file1 = self.files
        my_data = np.genfromtxt('/data/'+file1, delimiter=',')
        self.dim = my_data.shape[1]

        for j in range(my_data.shape[1]):
            tmp = np.reshape(my_data[:,j],(1,my_data.shape[0]))
            tmp = torch.from_numpy(tmp).float()
            self.data.append(tmp)        
        
        return self.data[i]

    def __len__(self): 
        
        return self.dim

But this is working terribly slow. I was wondering if I could store all of that data in one file but I don’t have enough RAM. So is there a way around it?

Let me know if there’s a way.

Source: Python-3x Questions

LEAVE A COMMENT