NLTK TweetTokenizer incorrectly separates contractions

  nltk, python, python-3.x, tokenize, twitter

I have a personal Python project where I am trying to tokenize tweets. I am using NLTK’s TweetTokenizer to break up these tweets. I am running into an issue where contractions incorrectly get broken up

EX "can’t" -> ["can", "’", "t"]

I am struggling to find any documentation on this error. I have pasted relevant code below.

An important note is that TweetTokenizer works with strings that I hardcode into my program, however, does not work with strings that originate from Twitter

from nltk.tokenize import TweetTokenizer
def tweetsTagger(tweets): #Tokenizes and tags the tweets
    tweetsTagged = []
    for tweet in tweets: #tweet is a status object from Twitter's Tweepy API
        text = ""
        if hasattr(tweet, 'full_text'):
            text = str(tweet.full_text)
        else:
            text = str(tweet.text)
        tt = TweetTokenizer()
        tweetTokenized = tt.tokenize(text)
        tweetTagged = pos_tag(tweetTokenized)
        tweetsTagged.append(tweetTagged)
    return tweetsTagged

I think the error may have to do with TweetTokenizer not recognizing certain Unicode apostrophes but I may be wrong about that.

Source: Python Questions

LEAVE A COMMENT