Fitting a Gaussian Process model to PCA. Predictions looks very wrong

  model, pca, python, regression, scikit-learn

I’m currently trying to fit a Gaussian Process model to my data and have it predict some days ahead. I have reduced my ~10 features down to just 2 components via PCA in sklearn. So now I have PCA1 and PCA2. This was obtained by performing PCA on the training set (40%).

pca = PCA(n_components=2)
pca.fit(train_data)
PCAs = pca.transform(train_data)
PCA1 = PCAs[:,0]
PCA2 = PCAs[:,1]

where train_data is the normalized dataframe with ~10 features and 50 rows.

kernel = RBF()
model = gaussian_process.GaussianProcessRegressor(kernel=kernel, normalize_y=True, n_restarts_optimizer=10)
model.fit(x_days_train, PCA1)
y_pred, y_std = model.predict(x_days, return_std=True)
model.score(x_days_train, PCA1)

where x_days if the full 50 days, and x_days_train is 20 days (0,1,2….). I get a score of 1.0. However, my predicted results looks terrible (as per below). It’s like after the training data, it just falls and then stagnates.

enter image description here

Not entirely sure what went wrong, but a couple guesses:

  • Since my data has no target variables, I used PCA on all the features in the dataframe and they are supposed to be x variables? And then I used them as a y variable (by predicting). Maybe this is an incorrect approach?
  • Following that, can PCA even be used as y_prediction?
  • Am I supposed to apply PCA to not just the training data, but also to the test data (apply fit_transform)?
  • I seem to be only using PCA1 and not PCA2 (nor a combination of the two). Should I use both? If so, how?

Would appreciate any help, thank you.

Source: Python Questions

LEAVE A COMMENT