Latent Dirichlet Allocation

Explained and improved upon for applications in marketing intelligence

More Info
expand_more

Abstract

In today's digital world, customers give their opinions on a product that they have purchased online in the form of reviews. The industry is interested in these reviews, and wants to know about which topics their clients write, such that the producers can improve products on specific aspects. Topic models can extract the main topics from large data sets such as the review data. One of these is Latent Dirichlet Allocation (LDA). LDA is a hierarchical Bayesian topic model that retrieves topics from text data sets in an unsupervised manner. The method assumes that a topic is assigned to each word in a document (review), and aims to retrieve the topic distribution for each document, and a word distribution for each topic. Using the highest probability words from each topic-word distribution, the content of each topic can be determined, such that the main subjects can be derived. Three methods of inference to obtain the topic and word distributions are considered in this research: Gibbs sampling, Variational methods, and Adam optimization to find the posterior mode. Gibbs sampling and Adam optimization have the best theoretical foundations for their application to LDA. From results on artificial and real data sets, it is concluded that Gibbs sampling has the best performance in terms of robustness and perplexity. In case the data set consists of reviews, it is desired to extract the sentiment (positive, neutral, negative) from the documents, in addition to the topics. Therefore, an extension to LDA that uses sentiment words and sentence structure as additional input is proposed: LDA with syntax and sentiment. In this model, a topic distribution and a sentiment distribution for each review are retrieved. Furthermore, a word distribution per topic-sentiment combination can be estimated. With these distributions, the main topics and sentiments in a data set can be determined. Adam optimization is used as inference method. The algorithm is tested on simulated data and found to work well. However, the optimization method is very sensitive to hyperparameter settings, so it is expected that Gibbs sampling as inference method for LDA with syntax and sentiment performs better. Its implementation is left for further research.