Create sklearn pipeline with column operations step

  dataframe, pandas, pipeline, python, scikit-learn

I wanted to know how can I insert into a sklearn pipeline one step which multiplies two columns values and delete the original ones.

I’m doing something like that.

  • After loading the Dataframe, I multiply the target columns and delete them.
  • Prepare X, Y, training set and test set.
  • Configure pipeline with StandardScaler and some ML method (for example Linear Regression)
  • Fit and predict.
import pandas as pd, numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline

# df is a pandas dataframe with columns A, B, C, Y
df.drop(columns=['B','C'], inplace=True)

X = df.loc[:,['A','BC']]
Y = df['Y']

x_train, x_test, y_train, y_test = train_test_split(X,Y,train_size=0.8)

pipe = Pipeline([
y_pred = pipe.predict(x_test)

With this approach, when I want to make some prediction of new data, I must pass the multiplication, for example A=1, B=3, C=4


And I want an approach like


What I want, is modify pipeline for something like

pipe = Pipeline([
    ('product', CustomFunction(columns_to_multiply, result_name_column)),

Is it possible with scikit-learn or custom functions? How?

Source: Python Questions