Well, the title was supposed to mimic Whats in a name.
So whats in a tweet? As you would expect, there is quite a lot. This post is to discuss (in layman terms) some of the features that we extract from a tweet and the algorithms used in the process.
Lets make a list of features.
1. Named Entities.
2. Words in the hashtags.
3. Words in the url.
9. Commercial Intent.
10. Features for text classification and other ML problems.
1. Named Entity Recognition:
NER involves identification of proper names in a text like a person’s name, organizations, locations, date, time, monetary expressions etc. Lets see an example which’ll be used throughout this post.
ex: Some residents of Koramangala met PM Singh in Delhi. They expressed satisfaction, but vowed to continue the protest. #LokpalBill http://t.co/xHRbW
where the url may expand to http://ndtv.com/politics/…/koramangala-pm-singh-delhi-lokpal-bill.htm
here, PM Singh, Delhi, Lokpal Bill, Koramangala or “residents of Koramangala” are Named Entities of interest. Though “Some” and “They” are capitalized, they are not that interesting.
There are two broad categories of algorithms used for NER.
1. Rule based models also called Linguistic grammar based models.
2. Statistical models like Hidden Markov Model, Maximum Entropy, Conditional Random Field etc.
While using a rule based approach, a number of things like capitalization, stopwords might help us identify the Entities.
2. URL analysis:
While keeping track of urls might help with clustering, ranking etc, the keywords present in the urls themselves are very interesting. Sometimes the tweet itself might not have entities (“Good Read. http:…”) but the urls may have words that are useful to us. If we can recognize a hierarchy in the url (“/politics/”), then such words can be used for classification among other things.
3. Hashtag analysis:
Hashtags for the most part, convey interesting information about the tweet. They may identify or reinforce the subject being discussed, point to the class or a related keyword, give hints about the sentiment etc. Identifying words in a hashtag can be interesting if its capitalized, camel case or multiple words written together etc.
ex: “Lokpal Bill” from #LokpalBill
4. Script Detection:
While the charset of most social media updates tend to be utf8, its not a bad idea to verify it ourselves. Once done, its a question of going over the code points and comparing with known ranges for various scripts. However its advisable to optimize this process using appropriate data structures and counting schemes. We should also keep in mind the presence of other language characters (esp english) in handles, urls, hashtags etc.
5. Language Classification:
While twitter itself identifies a number of languages, its not a bad idea to implement a language classifier ourselves. A Naive Bayes Classifier seems more than sufficient to identify languages. All you need to do is find ngrams from the words in a tweet and use NBC to classify. A number of optimization steps are possible while selecting ngrams and while implementing the classifier. Identifying the language of the tweet is especially important if the tweet has vernacular language words transliterated in English. Such words might affect rule based NER for example.
ex: while the example tweet is unlikely to confuse the classifier watch out for “Singh” and “Lokpal” in other circumstances.
6. Profanity Detection:
While a keyword list or taxonomy based profanity lookup seems reasonable enough, we need to ensure that deliberate or accidental miss-spelling of profane words don’t escape our detector. A brute-force method of looking up such miss-spellings or a edit distance based detection, both are promising.
For a number of reasons, finding the stem of the words in a tweet is useful. At the very least, this can help in maintaining shorter dictionaries. But stemming may become necessary if we have to find related keywords and index the tweet for searching etc. An implementation of a Porter stemmer is made available by Dr. Porter himself. A combination of stemming programs is typically used to do stemming. And maintaining a good stemming dictionary is also important.
8. Sentiment Analysis:
A number of supervised learning methods can be used to classify tweets as positive/negative or into more detailed categories. A keyword based lookup can also be implemented but the results are likely to be coarse. Besides the traditional features like keywords, exclamation mark etc, tweets typically have emoticons and “loooong” words to express sentiment.
ex: “They expressed satisfaction”
9. Intent Detection:
Tweets where users pose a question or express an intent, especially commercial indent, are useful for recommendation systems. Similar to sentiment analysis, a number of supervised learning methods can be used to determine if a tweet specifies intent or otherwise.
ex: “but vowed to continue the protest.”
10. Features for text classification, clustering:
Training text classifiers using real time tweets is a very interesting way to keep the corpus’ up to date. A naive bayes classifier might be sufficient if the number of classes are not very large. However more interesting hierarchies can be built by finding related words and using the words in hashtags/urls etc. Such hierarchies are useful in building taxonomies.
ex: PM Singh, Lokpal Bill, politics.
Note: There are a large number of insights obtained by keeping track of retweet counts, replies, follower/list counts, connected-ness, conversations etc. They are more like the ‘dynamic score’ for a tweet while the above features within the tweet help find its ‘static score’.