Category : huggingface-datasets

I’m trying to tokenize 2 sentences respectively and store the output as datasets.Sequence as the pretrained model takes input_ids, token_type_ids and attention_mask as input. I’d like to make a dataset like {‘answers’: Sequence(feature={‘text’: Value(dtype=’string’, id=None), ‘answer_start’: Value(dtype=’int32′, id=None)}, length=-1, id=None), ‘context’: Value(dtype=’string’, id=None), ‘id’: Value(dtype=’string’, id=None), ‘question’: Value(dtype=’string’, id=None), ‘title’: Value(dtype=’string’, id=None)} which is shown ..

Read more

I’m currently building a siamese network with a pretrained Bert model which takes ‘input_ids’, ‘token_type_ids’ and ‘attention_mask’ as inputs from transformers. I’ve got a dataset structured as question1, question2, label, so I have to tokenize questions respectively. def tokenize(ds): q1=datasets.Sequence(tokenizer(ds[‘question1′], padding=’max_length’, truncation=True, max_length=128)) q2=datasets.Sequence(tokenizer(ds[‘question2′], padding=’max_length’, truncation=True, max_length=128)) return {"q1":q1,"q2":q2} dataset_tokenized = dataset.map(tokenize) the process has ..

Read more

I’m following along to this Notebook, cell "Loading the dataset". I want to use datasets library. I’ve restarted and rerun the Kernel conda_pytorch_p36 without luck. I run: ! pip install datasets transformers optimum[intel] Output: Requirement already satisfied: datasets in /home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages (1.17.0) Requirement already satisfied: transformers in /home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages (4.15.0) Requirement already satisfied: optimum[intel] in /home/ec2-user/anaconda3/envs/pytorch_p36/lib/python3.6/site-packages (0.1.3) ..

Read more

In the dataset I have 5000000 rows, I would like to add a column called ’embeddings’ to my dataset. dataset = dataset.add_column(’embeddings’, embeddings) The variable embeddings is a numpy memmap array of size (5000000, 512). But I get this error: ArrowInvalidTraceback (most recent call last) in —-> 1 dataset = dataset.add_column(’embeddings’, embeddings) /opt/conda/lib/python3.8/site-packages/datasets/arrow_dataset.py in wrapper(*args, ..

Read more

I want to pre-train a T5 model using huggingface. The first step is training the tokenizer with this code: import datasets from t5_tokenizer_model import SentencePieceUnigramTokenizer vocab_size = 32_000 input_sentence_size = None # Initialize a dataset dataset = datasets.load_dataset("oscar", name="unshuffled_deduplicated_fa", split="train") tokenizer = SentencePieceUnigramTokenizer(unk_token="<unk>", eos_token="</s>", pad_token="<pad>") # Build an iterator over this dataset def batch_iterator(input_sentence_size=None): if ..

Read more

I am using SageMaker to train a model with multiple GBs of data. My data is loaded using huggingface’s datasets.load_dataset method. Since data is huge and I want to re-use it, I want to store it in S3 bucket. I tried below: from datasets import load_dataset dataset = load_dataset(‘s3://bucket_name/some_dir/data’, ‘oscar’, ‘unshuffled_deduplicated_en’) but this results in: ..

Read more

I am trying to build a pre-trained model using dialog-gpt2 (Grossmend/rudialogpt3_medium_based_on_gpt2) and a custom dataset. Whenever i try to run the main function I get this error: TypeError: TextEncodeInput must be Union[TextInputSequence, Tuple[InputSequence, InputSequence] /usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_fast.py in _batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose) 406 batch_text_or_text_pairs, ..

Read more

I’m new to huggingface and am working on a movie generation script. So far my code looks like this from transformers import GPT2Tokenizer, GPTNeoModel from datasets import load_dataset dataset = load_dataset(‘text’,data_files={‘train’:[‘youtube_3/script.txt’]}) tokenizer = GPT2Tokenizer.from_pretrained(‘EleutherAI/gpt-neo-1.3B’) model = GPTNeoModel.from_pretrained(‘EleutherAI/gpt-neo-1.3B’) However I keep getting this error ValueError: Please pass `features` or at least one example when writing data ..

Read more

I have been trying to train a Distilbert with huggingface and following the tutorial in this link and building the dataset such as: feat = Features({ "X" : Sequence(Value("string")), "labels": Sequence(ClassLabel(num_classes=len(label_list), names=label_list))}) dt_train = Dataset.from_dict(train, features = feat) dt_test = Dataset.from_dict(test, features = feat) dt_dev = Dataset.from_dict(dev, features = feat) Right now I have a ..

Read more