[First I would like to thank the community of SO; for whose support I could actually finish making an end to end data science project (including deployment.)]
Over to my question.
I wanted to create an app that will predict success or failure depending upon textual input data.
In order to construct the model; instead of developing an algorithm from scratch; I used existing algorithms instead.
While training my dataframe [which is a textual corpus]; I frequently wrote:
pipeline = Pipeline( [ ("vect", CountVectorizer()), ("tfidf", TfidfTransformer()), ("clf", my_chosen_model()), ] )
That is because I saw several code examples doing so. I however, didn’t clearly understand the usage of pipeline. Can anyone please explain it (or share any link which can help people like me from non-programming backgrounds understand it better.)
After completing my experiments I obtained accuracy scores on text data:
RandomForest SVM MLP MultinomialNB 0.80 0.85 0.86 0.80
When I increased the number of data (from initial 20000 to 50000), accuracy scores were:
RandomForest SVM MLP MultinomialNB 0.85 0.85 0.87 0.82
I also tried BERT; which gave an accuracy of : 0.72 on text data.
Can someone please explain to me here:
Why BERT performed so low – despite that BERT is preferred by many. I followed the exact steps as mentioned here for my task: https://www.analyticsvidhya.com/blog/2020/07/transfer-learning-for-nlp-fine-tuning-bert-for-text-classification/
What do the accuracy scores imply about different algorithms in regards to data? And will it be wise to choose either MLP / SVM or choose Random forest due to a jump in performance.
Source: Python-3x Questions