XGBoost : How to get feature names of a encoded dataframe for feature importance plot?

  python, xgboost

I am using xgboost to make some predictions. We do some pre-processing, hyper-parameter tuning before fitting the model. While performing model diagnostics, we’d like to plot feature importances with feature names.

Here are the steps we’ve taken.

# split df into train and test
X_train, X_test, y_train, y_test = train_test_split(df.iloc[:,0:21], df.iloc[:,-1], test_size=0.2)

X_train.shape
(1671, 21)

#Encoding of categorical variables

cat_vars = ['cat1','cat2']
cat_transform = ColumnTransformer([('cat', OneHotEncoder(handle_unknown='ignore'), cat_vars)], remainder='passthrough')

encoder = cat_transform.fit(X_train)
X_train = encoder.transform(X_train)
X_test = encoder.transform(X_test)

X_train.shape
(1671, 420)

# Define xgb object
model = XGBRegressor()

# Tune hyper-parameters
r = RandomizedSearchCV(model, param_distributions=params, n_iter=200, cv=3, verbose=1, n_jobs=1)

# Fit model
r.fit(X_train, y_train)

xgb = r.best_estimator_
xgb

# Plot feature importance

plt.barh(X_train.feature_names?, xgbest.feature_importances)

X_train has encoded variable names only. And we cannot use column names with orig dataframe because of shape mismatch (21 vs 420).

Source: Python Questions

LEAVE A COMMENT