Python: merge large datasets and how to work with large data (500 Gb)

  large-data, python, python-3.x, python-sql

I have some large csv files which I need to merge them together. each file is about 5gb and my RAM is only 8gb. I use the following code to read some csv files into dataframes and merge them on columns fund_ticker ticker and date.

import numpy as np
import pandas as pd

# Read in data, ignore column "version"
table1 = pd.read_csv(r'C:datadata1.csv', usecols=lambda col: col not in ["Version"])
table2 = pd.read_csv(r'C:datadata2.csv', usecols=lambda col: col not in ["Version"])
weight = pd.read_csv(r'C:datadata3.csv', usecols=lambda col: col not in ["Version"])

print("Finish reading")

# merge datasets
merged = data1.merge(data2, on=['fund_ticker', 'TICKER', 'Date']).merge(data3, on=['fund_ticker', 'TICKER', 'Date'])

Unfortunately, I got the following error:

numpy.core._exceptions.MemoryError: Unable to allocate 105. MiB for an array with shape (27632931,) and data type object

After searching on the internet, I think the issue is that the data is larger than my RAM. To overcome this issue, I am thinking about using some database such as SQL or parquet files. My question is which is the most efficient ways to work with large datasets? My data is financial data and could go up to 500 Gb or 1 Tb. Some direction to how to set up would be much appreciated. Thanks

Source: Python-3x Questions

LEAVE A COMMENT