Automated classification of user reviews

Detection of topic and sentiment

More Info
expand_more

Abstract

Online customer reviews on products have become a large part of marketing intelligence in recent years. These documents are a source of information on what aspects of a discussed product can be improved upon. These aspects are named drivers. CQM, the company in which the internship for which this thesis was written took place, has developed a tool in which reviews are manually annotated for a fixed set of drivers. The result is that for each driver, each review can be assigned a driver score. The driver score can be a positive or negative number, indicating with what sentiment the reviews discusses the driver, or left blank, meaning that the review does not discuss the driver. The goal of this work is to (partially) automate this process, so that, given the review, the driver scores can be predicted. The first step towards achieving this is creating a binary classification problem for every driver, where a binary variable can, for example, indicate whether a review discusses the driver or not. A step further is multinomial classification, where one can also distinguish whether a driver is discussed with positive or negative sentiment. The reviews are represented as variables, with every variable representing the use of a word. In this thesis, different forms for these variables are experimented with. With every variable representing a unique word, a large number of distinct predictors is offered for the classification problem. Given these predictors, two types of models are considered to solve the classification problem: Elastic net models and random forests. Both types of models need to be adapted for the class imbalance before they can be used for the classification problem of predicting the driver scores. The results of these models are evaluated using the area under the curve (AUC) and for the multinomial problem a multinomial generalisation, the AUCμ. These measures are chosen, because they are effective at evaluating the performance of our models in the context of the class imbalance. The results were ultimately evaluated for various variable forms and models. For the model we found that a random forest adapted to use stratified bootstrap samples to grow decision trees gave strong performance, especially when combined with variables that were given an indicator function form or normalized tf-idf form.