I watned to predict the income using text data(description) as a predictor. This is how my dataframe looks like:
c_description
641 fierce roman commander marcus vinicius become ...
645 melancholy poet reflect three woman love lose ...
644 disturb blanche dubois move sister new orleans...
643 lonely woman recall first love thirteen year p...
642 three adolescent girl grow bengal india learn ...
d_worldwide_gross_income
641 1034933.275020
645 1089736.217494
644 505025.329393
643 73424.113475
642 544123.669819
Here is the modelling code:
def model():
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import Ridge
import numpy as np
from sklearn import metrics
vectorizer = TfidfVectorizer()
x = vectorizer.fit_transform(model_df['c_description'])
vectorizer.get_feature_names()
y = model_df['d_worldwide_gross_income']
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.3, random_state=1)
clf = Ridge()
clf.fit(X_train, y_train)
pred = clf.predict(X_test)
print(pred)
pred_df = pd.DataFrame({'Actual': y_test, 'Predicted': pred})
display(pred_df)
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, pred))
print('Mean Squared Error:', metrics.mean_squared_error(y_test, pred))
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, pred)))
The thing is, I get negative prediction (starred in the output), which does not make sense (gross income can’t be negative). I assume this is common when you predict a continuous variable with vectorized text data (sparse matrix) that contains tons of 0 in it. But is there a way to handle this issue?
Output:
Actual Predicted
14678 6833413.127504 2849365.333598
12631 15076388.644552 7301462.466993
16131 1512745.545534 3046698.088006
4406 25325.846617 **-1436044.714117**
21199 124397.540278 5321914.505052
Mean Absolute Error: 4102039.343052313
Mean Squared Error: 35381871200690.305
Root Mean Squared Error: 5948266.234852834
Besides, the MSE is incredible high, the accucary of the model seem very low. I’m also seeking advise on improving the accuracy. Is classifier a better option in this scenario?
Any input will be appreciated, ty in advance, stay safe everyone.
Source: Python Questions