I am new to NLP and I am working on an email classification project. I am programming in python and I use bert for this task, however I have an issues with the email texts.
Quite a few of these emails contain disclaimer texts which are not relevant to the email. In many cases the text of the disclaimer is even longer than the relevant email text.
So I thought I will get rid of these because this way:
- i will make email texts shorter
- i will remove noise from the dataset
My only problem is how to do that? Could you give some hints or ideas how to tackle this problem?
My question is rather focusing on the possible methodologies and not on the technique.
Could you give me some pointers on how to be able to remove these unwanted texts?
Thank you in advance.
Source: Python Questions