What is the correct way to calibrate probabilities when you have multiclass problem?

  calibration, grid-search, python, scikit-learn, scipy

I am training a model to predict the label (target) based on loan status e.g. 0,1,2,3. So i have 4 classes. I have so far trained a model as follows:

  from HyperclassifierSearch import HyperclassifierSearch

X = data.iloc[:, :-1]
y = data.label

X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.2, 
# Create a hold out dataset to train the calibrated model to prevent overfitting
X_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train, 
stratify=y_train, test_size=0.2, random_state=42)
categorical_transformer = OneHotEncoder(handle_unknown='ignore')
numeric_transformer = Pipeline(steps=[('imputer',SimpleImputer(missing_values=np.nan, fill_value=0) ),('scaler', StandardScaler())])
preprocessor = ColumnTransformer(transformers=[('num', numeric_transformer, numeric_cols),
                                    ('cat', categorical_transformer, cat_cols)])

#then i use hyperclassifer library 

models = {  'xgb': Pipeline(steps=[('preprocessor', preprocessor),('clf', XGBClassifier(objective='multi:softprob'))]),
                       'rf': Pipeline(steps=[('preprocessor', preprocessor),('clf', RandomForestClassifier(criterion = 'entropy', random_state = 42))]) }

search = HyperclassifierSearch(models, params)
best_grid = search.train_model(X_train, y_train, cv=3, n_jobs=-1, scoring='accuracy')
results = search.evaluate_model()
fitted_model = best_grid.best_estimator_

pred = fitted_model.predict_proba(X_test)
labels = fitted_model.predict(X_test)

**note i have omitted alot of the imported libs and params dict since large and only included hyperclassifier since it is large **

my pred is a matrix containing 4 columns where each is related to the class of the loan. Generally i know it is good pracitce to calibrate the probabilities and particularly from tree based algorithms the output is a score not really a probability. I am however confused as to how to calibrate these probabilities.

Usually i would calibrate using the holdout validation set but am unsure how to do it with multiclass


Should i ammend the above xgbclassifier by doing the following:

OneVsRestClassifier(CalibratedClassifierCV(XGBClassifier(objective='multi:softprob'), cv=10))

source: Multiclass linear SVM in python that return probability

My question is what is the correct way to calibrate probabilities from multiclass model?

Source: Python Questions