BERT Input for next sentence prediction (python NLP)

I am newer to NLP and coding in python. I am trying to train a BERT model on predicting the correct next utterance. I am given a disentangled conversation, and am trying to select the next utterance from a candidate pool of 100 which might not contain the correct next utterance. I am trying to create a model trained on data that comes in this input:

  {
"data-split": "train",
    "example-id": 0,
    "messages-so-far": [
        {
            "date": "2007-02-13",
            "speaker": "participant_0",
            "time": "07:31",
            "utterance": "hi guys, i need some urgent help. i "rm -rf'd" a direcotry. any way i can recover it?"
        },
        {
            "date": "2007-02-13",
            "speaker": "participant_1",
            "time": "07:31",
            "utterance": "participant_0 : in short, no."
        },
        {
            "date": "2007-02-13",
            "speaker": "participant_0",
            "time": "07:31",
            "utterance": "participant_1 , are you sure?"
        },
        ...
    ],
    "options-for-correct-answers": [
        {
            "candidate-id": "3d06877cb2f0c1861b248860fa60ce07",
            "speaker": "participant_1",
            "utterance": ""Are you sure?" is something rm -rf never asks.."
        }
    ],
    "options-for-next": [
        {
            "candidate-id": "ace962b708d559fc462b7fdd9b6fc093",
            "speaker": "participant_1",
            "utterance": "(and if hardware is detected correctly, of course)"
        },
        {
            "candidate-id": "349efca9c3d5986a87d95fb90c1b7c04",
            "speaker": "participant_2",
            "utterance": "how do i do a simulated reboot"
        },
        ...
     ],
  "scenario": 1 
  }

The field messages-so-far contains the context of the dialog and options-for-next contains the candidates to select the next utterance from. The correct next utterance is given in the field options-for-correct-answers. The field scenario refers to the subtask.

What format should I make this data into? It is currently in JSON. I know it needs to be a tsv file but I am having a hard time figuring out what should be in the columns.

I wrote code that puts it into this format

but I don’t think this is what I want. Any help is appreciated

For reference this is the code that processes it into that format. ANy suggestions on how to change this to what I want to beable to input it into a TSV file for BERT training would be awesome!

    import json

file_path = "/Users/madison/Desktop/Final 1671/NOESIS-II/subtask1/data/task-1.advising.train.json"

with open(file_path) as json_file:
    records = (json.load(json_file))

    example_id = []
last_sentence = []
next_sentence = []

for row in records:

  example_id.append(row['example-id'])
  last_sentence.append(row['messages-so-far'][-1]['utterance'])

  if len(row['options-for-correct-answers']) != 0:
    next_sentence.append(row['options-for-correct-answers'][0]['utterance'])
  else:
    next_sentence.append("None")
   
import pandas as pd

data = {"example_id": example_id, "last_sentence": last_sentence, "next_sentence": next_sentence}
df = pd.DataFrame(data)

print(df.head())

Source: Python Questions

LEAVE A COMMENT