Category : tokenize

text = "XXXXXX,XXXX “XX” , XXXXX 50 XX 、 XXXXXXXXX、XXXXXXXXXXXXXXXXXXX" zh_nlp = stanza.Pipeline(‘zh’) doc = zh_nlp(text) for sent in doc.sentences: print("Tokenize:" + ‘ ‘.join(token.text for token in sent.tokens)) stanza is capable of separating text if it is entered in Chinese (Chinese relies heavily on Tokenization & Sentence Segmentation). It outputs the following. It’s roughly divided ..

Read more

I have a French text with two apostrophes. I would like to split all the apostrophes in the same way. For instance: >>> from nltk import word_tokenize >>> doc = "l’examen est normale. Il n’y a aucun changement" >>> tokens = word_tokenize(doc, language=’french’) >>> tokens ["l’examen", ‘est’, ‘normale’, ‘.’, ‘Il’, ‘n’, "’", ‘y’, ‘a’, ‘aucun’, ..

Read more