Comparing Feature Sets and Classifiers for Sentiment Analysis of Opinionated Free Text
More Info
expand_more
Abstract
This master thesis is about the sentiment analysis of the societal theme documents and categorizing them in positive or negative groups. The application of this thesis can be widely used in review blogs, public polls and etc. In this study, we have compared different feature sets as well as different classifiers on datasets of opinionated texts with societal themes. These datasets consist of one large and 6 small sets in terms of number of documents. By considering the often used “Bag of Words” feature set as the base line we have tested 4 other models and came to this conclusion that selecting features with their part of speech tags can always improve the results of sentiment classification while adjective and negation tags can describe the opinionated documents more informatively in much smaller matrices which saves a lot of memory and processing time. Moreover, by selecting these tags according to their PMI ranks in positive and negative labeled documents, we obtained the most informative sentimental words. On the other hand, based on the obtained results, in contrast with the predominant attitude in the sentiment analysis field that support vector machine (SVC) is the best classifier for binary classification of opinionated documents, we found the linear discriminate classifier (LDC) can perform as well as support vector machine but 10 times faster. The consumed time is convincing enough to substitute SVC with LDC in sentiment analysis when we have a large number of features as is the case in a Bag of Words model due to the fact that the time that SVC needs for predicting labels is quadratic in terms of number of documents. The feature vectors obtained through PMI analysis are relatively small, we found that as a consequence, the k-nearest Neighbor Classifier (KNNC) could train well and gave the most accurate results in comparison with LDC and SVC in both large and small datasets. It should be stressed that Principal Component Analysis (PCA) is used in this study in order to extract mathematically the most common features in all the models.