I am working with multiple csv files, each containing multiple 1D data. I have about 9000 such files and total combined data is about 40 GB.
I have written a dataloader like this:
class data_gen(torch.utils.data.Dataset): def __init__(self, files): self.files = files my_data = np.genfromtxt('/data/'+files, delimiter=',') self.dim = my_data.shape self.data =  def __getitem__(self, i): file1 = self.files my_data = np.genfromtxt('/data/'+file1, delimiter=',') self.dim = my_data.shape for j in range(my_data.shape): tmp = np.reshape(my_data[:,j],(1,my_data.shape)) tmp = torch.from_numpy(tmp).float() self.data.append(tmp) return self.data[i] def __len__(self): return self.dim
But this is working terribly slow. I was wondering if I could store all of that data in one file but I don’t have enough RAM. So is there a way around it?
Let me know if there’s a way.
Source: Python-3x Questions