Read files from Azure Blob Storage into Pandas DataFrame

I have ~100GB of data located in Azure Blob storage that I want to analyze with a Pandas DataFrame inside a Jupyter Notebook (running in AWS SageMaker). Therefore, I need to get all the single files (< 1 million files) from Azure and add them into a large DataFrame.

I have a working solution, but the performance is very poor, even on a subset of the data. Ideally, I would like to avoid duplicate data storage, both in Azure and AWS. Even though I found this post by Azure. Is there a more efficient way to perform this task?

%%timeit -n1 -r1
import pandas as pd
from azure.storage.blob import BlobServiceClient

blob_service_client = BlobServiceClient.from_connection_string(CONNECTION_STRING)
container_client = blob_service_client.get_container_client(CONTAINERNAME)

df = pd.DataFrame()
blob_counter = 0

# Iterate through list of relevant blobs
for blob in container_client.list_blobs(name_starts_with="2002/09/"):
    
    blob_client = blob_service_client.get_blob_client(CONTAINERNAME, blob)
    downloaded_blob = blob_client.download_blob()   
     
    # Read downloaded content into a Pandas DataFrame   
    df = df.append(pd.read_json(downloaded_blob.content_as_text()))
    
    blob_counter += 1
        
print(f'Found {blob_counter} blobs')

df

Thanks!

Source: Python Questions

LEAVE A COMMENT