Using Skip-Gram Model to Predict from which Show a Given Line is

More Info
expand_more

Abstract

Text classification has a wide range of usage such as extracting the sentiment out of a product review, analyzing the topic of a document and spam detection. In this research, the text classification task is to predict from which TV-show a given line is. The skip-gram model, originally used to train the Word2Vec sentence embeddings [Mikolov et al, 2013], is adapted to determine the likelihood of occurrence of a sentence in a TV-show. Based on this feature, a classifier is built to perform the task of this research. The results of the cross-validation show that it reaches an accuracy of 58% when running on the transcript data of 3 shows and 43% on 4 shows, while the accuracies of random guessing are supposed to be 33% and 25%. The difference between the neural networks and the skip-gram model becomes smaller when more shows are added to evaluate the model. Among each 5 fold cross-validation of the two models, the best results appear in the midmost iterations.