XGBoost fails to fit data

  python-3.x, scikit-learn, xgbclassifier, xgboost

I have two separate working models. They are identical other than one uses Random Forest and one uses XGBoost.

Yesterday I made changes to the data (I added two columns) and trained the RF model. It now scores about 4% higher than before I added the two columns.

So today I commented out the RF model and plugged in the XGBoost model (from the existing working model).

When I try to run the XGBoost model I am now getting:

ValueError: DataFrame.dtypes for data must be int, float, bool or categorical.  When
                categorical type is supplied, DMatrix parameter
                `enable_categorical` must be set to `True`.

I am NOT passing unencoded categorical data to XGBoost.

The data features:

construction_year       int32   <--  Added yesterday
amount_tsh            float64   <--  Added yesterday
basin                category
region_code          category
lga                  category
extraction_type      category
management           category
payment              category
quality_group        category
quantity             category
source               category
waterpoint_type      category
cluster              category
temp                   object     <--- HOLD ON. See below.
dtype: object 

‘temp’ is there just to split the df after encoding back into separate train and test sets before I train the model. It gets removed after I perform the split.

X = dfx[dfx[temp_df.shape[1]-1] == 'train']
X2 = dfx[dfx[temp_df.shape[1]-1] == 'test']
print(X.head())
print(X2.head())
X = X.iloc[:, :-1]
X2 = X2.iloc[:, :-1]
print(X.head())
print(X2.head())

And the proof:

    0     1     2     3     4     5     6     7     8     9     10    11   ...   224   225   226   227   228   229   230   231   232   233      234    235
0 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000  ... 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000  1999 1200.000  train
1 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000  ... 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000  2010    0.000  train
2 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000  ... 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000  2009   25.000  train
3 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000  ... 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000  1986    0.000  train
4 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000  ... 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000  2006    0.000  train
        0     1     2     3     4     5     6     7     8     9     10    11   ...   224   225   226   227   228   229   230   231   232   233     234   235
59400 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000  ... 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000  2012   0.000  test
59401 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 1.000 0.000  ... 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000  2000   0.000  test
59402 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000  ... 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000  2010   0.000  test
59403 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000  ... 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000  1987   0.000  test
59404 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000  ... 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000  2000 500.000  test
    0     1     2     3     4     5     6     7     8     9     10    11   ...   223   224   225   226   227   228   229   230   231   232   233      234
0 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000  ... 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000  1999 1200.000
1 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000  ... 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000  2010    0.000
2 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000  ... 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000  2009   25.000
3 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000  ... 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000  1986    0.000
4 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000  ... 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000  2006    0.000
        0     1     2     3     4     5     6     7     8     9     10    11   ...   223   224   225   226   227   228   229   230   231   232   233     234
59400 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000  ... 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000  2012   0.000
59401 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000 1.000 0.000  ... 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000  2000   0.000
59402 1.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000  ... 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000  2010   0.000
59403 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000  ... 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000  1987   0.000
59404 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.000 0.000 0.000  ... 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 1.000  2000 500.000

The last column containing ‘train’ and ‘test’ is gone.

I opened up the df in PyCharm. NOTHING is unencoded, except for the two new numeric columns(233 and 234).

Also, the shape of the combined df and the shapes of the separated train and test df’s also shows the last column removed:

(74250, 236)
(59400, 235)
(14850, 235) 

I am at a loss as to understanding why XGBoost thinks I am passing it unencoded categorical data. As I stated, this all worked until I added the two new numerical columns yesterday. The RF model works with the two new columns.

Source: Python-3x Questions

LEAVE A COMMENT