Create sklearn pipeline with column operations step

  dataframe, pandas, pipeline, python, scikit-learn

I wanted to know how can I insert into a sklearn pipeline one step which multiplies two columns values and delete the original ones.

I’m doing something like that.

  • After loading the Dataframe, I multiply the target columns and delete them.
  • Prepare X, Y, training set and test set.
  • Configure pipeline with StandardScaler and some ML method (for example Linear Regression)
  • Fit and predict.
import pandas as pd, numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline


# df is a pandas dataframe with columns A, B, C, Y
df['BC']=df['B']*te['C']
df.drop(columns=['B','C'], inplace=True)

X = df.loc[:,['A','BC']]
Y = df['Y']

x_train, x_test, y_train, y_test = train_test_split(X,Y,train_size=0.8)

pipe = Pipeline([
    ('minmax',StandardScaler()),
    ('linear',LinearRegression())
])

pipe.fit(x_train,y_train)
y_pred = pipe.predict(x_test)

With this approach, when I want to make some prediction of new data, I must pass the multiplication, for example A=1, B=3, C=4

print(pipe.predict(np.array([[1,12]])))

And I want an approach like

print(pipe.predict(np.array([[1,3,4]])))

What I want, is modify pipeline for something like

pipe = Pipeline([
    ('product', CustomFunction(columns_to_multiply, result_name_column)),
    ('minmax',StandardScaler()),
    ('linear',LinearRegression())
])

Is it possible with scikit-learn or custom functions? How?

Source: Python Questions

LEAVE A COMMENT