BERT Domain Adaptation

I am using transformers.BertForMaskedLM to further pre-train the BERT model on my custom dataset. I first serialize all the text to a .txt file by separating the words by a whitespace. Then, I am using transformers.TextDataset to load the serialized data with a BERT tokenizer given as tokenizer argument. Then, I am using BertForMaskedLM.from_pretrained() to load the pre-trained model (which is what transformers library presents). Then, I am using transformers.Trainer to further pre-train the model on my custom dataset, i.e., domain adaptation, for 3 epochs. I save the model with trainer.save_model(). Then, I want to load the further pre-trained model to get the embeddings of the words in my custom dataset. To load the model, I am using AutoModel.from_pretrained() but this pops up a warning.

Some weights of the model checkpoint at {path to my further pre-trained model} were not used when initializing BertModel

So, I know why this pops up. Because I further pre-trained using transformers.BertForMaskedLM but when I load with transformers.AutoModel, it loads it as transformers.BertModel. What I do not understand is if this is a problem or not. I just want to get the embeddings, e.g., embedding vector with a size of 768.

Source: Python Questions

LEAVE A COMMENT