Issue processing large JSON request into multiple files

  elasticsearch, python, python-3.x

I’m trying to get a request from Elasticsearch, and from that request, return it as a JSON response, process that JSON response, hash a number of fields within that response, and then break that JSON response, which is exceptionally large, into many files for ML processing. Normally I would just request, process, and ship API, but I am required to write the data to static files.

My attempts so far has have led me to believe there are 2-3 potential solutions, one is to break the JSON request itself into many pieces via TimeDelta, and just make many smaller requests, but that would likely generate many files.

The second/third is to use what I was told is chunking, which I think would most likely be implemented via the snippet I put together below with open, using files, and an iterating loop, or using streams=true with requests, for which I found this regarding that solution:
How to get large size JSON files using requests module in Python

Im not familiar with using chunking with requests, so any advice if that is the best option would be great.

Ive been working on this on and off for a while now as priorities have continued to shift, but I would love to close this item out so I can focus on my main projects, so any assistance anyone could provide would be great!

my_dict = resp.json()
​
​
x = 0
y = 0
listfile = "listfile{}.txt".format(y)
with open(listfile, "w+") as file:
    for index, resp in enumerate(my_dict["hits"]["hits"]):
        # ACCOUNT NUMBER
        path = my_dict["hits"]["hits"][index]["_source"]["account_number"]
        hashed = hashlib.md5(path.encode()).hexdigest()
        my_dict["hits"]["hits"][index]["_source"]["account_number"] = hashed
        
        # CARDHOLDER NAME
        path = my_dict["hits"]["hits"][index]["_source"]["cardHolderName"]
        hashed = hashlib.md5(path.encode()).hexdigest()
​
        my_dict["hits"]["hits"][index]["_source"]["cardHolderName"] = hashed
​
        # CARDHOLDER ADDRESS
        path = my_dict["hits"]["hits"][index]["_source"]["cardHolderAddress"]
        hashed = hashlib.md5(path.encode()).hexdigest()
​
        my_dict["hits"]["hits"][index]["_source"]["cardHolderAddress"] = hashed
​
        # PRESENTATION INSTRUMENT ID
        path = my_dict["hits"]["hits"][index]["_source"]["presentation_instrument_id"]
        hashed = hashlib.md5(path.encode()).hexdigest()
​
        my_dict["hits"]["hits"][index]["_source"]["presentation_instrument_id"] = hashed
​
        # PRESENTATION INSTRUMENT IDENTIFIER
        path = my_dict["hits"]["hits"][index]["_source"]["presentation_instrument_identifier"]
        hashed = hashlib.md5(path.encode()).hexdigest()
​
        my_dict["hits"]["hits"][index]["_source"]["presentation_instrument_identifier"] = hashed 
​
        
    file.seek(0)
    json.dump(my_dict, file, indent=4)

    enter code here

################# Begin Snippet #2 #########################

# chunking learning attempt (Breaking 1 list or large set of data into many files)
​
# define list of places
places = ['Berlin', 'Cape Town', 'Sydney', 'Moscow']
x = 0
y = 0
listfile = "listfile{}.txt".format(y)
​
with open(listfile, 'w') as filehandle:
    for listitem in places:
        filehandle.write('%sn' % listitem)
        x += 1
        # checking to see if x is iterating by 1
        print(x)
​
        # checking to see if the filename is the same or iterating
        print(listfile)
​
        # checking to see if the y value is same or iterating
        print("this is the current y: {}".format(y))
​
        # checking to see x value, if x is equal to 3, then interate y by 1, which should change filename from listfile0 to listfile1, etc, and changes x back to 0, and continues loop until all items are processed.
        if x == 3:
            y += 1
            listfile = "listfile{}.txt".format(y)
            x = 0
            print(y)
            continue

Source: Python Questions

LEAVE A COMMENT