Ways to handle negative value of prediction in regression model

  data-modeling, predict, python, regression, tf-idf

I watned to predict the income using text data(description) as a predictor. This is how my dataframe looks like:

c_description
641  fierce roman commander marcus vinicius become ...   
645  melancholy poet reflect three woman love lose ...   
644  disturb blanche dubois move sister new orleans...   
643  lonely woman recall first love thirteen year p...   
642  three adolescent girl grow bengal india learn ...   

d_worldwide_gross_income  
641            1034933.275020  
645            1089736.217494  
644             505025.329393  
643              73424.113475  
642             544123.669819

Here is the modelling code:

def model():
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
import numpy as np
from sklearn import metrics

vectorizer = TfidfVectorizer()
x = vectorizer.fit_transform(model_df['c_description'])
vectorizer.get_feature_names()
y = model_df['d_worldwide_gross_income']

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=1)
clf = Ridge()
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
print(pred)

pred_df = pd.DataFrame({'Actual': y_test, 'Predicted': pred})
display(pred_df)

print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, pred)))

The thing is, I get negative prediction (starred in the output), which does not make sense (gross income can’t be negative). I assume this is common when you predict a continuous variable with vectorized text data (sparse matrix) that contains tons of 0 in it. But is there a way to handle this issue?

Output:

           Actual       Predicted
14678  6833413.127504  2849365.333598
12631 15076388.644552  7301462.466993
16131  1512745.545534  3046698.088006
4406     25325.846617 **-1436044.714117**
21199   124397.540278  5321914.505052
Mean Absolute Error: 4102039.343052313
Mean Squared Error: 35381871200690.305
Root Mean Squared Error: 5948266.234852834

Besides, the MSE is incredible high, the accucary of the model seem very low. I’m also seeking advise on improving the accuracy. Is classifier a better option in this scenario?

Any input will be appreciated, ty in advance, stay safe everyone.

Source: Python Questions

LEAVE A COMMENT