Enhancing Real-Time Twitter Filtering and Classification using a Semi-Automatic Dynamic Machine Learning setup approach

More Info
expand_more

Abstract

Twitter contains massive amounts of user generated content that also contains a lot of valuable information for various interested parties. Twitcident has been developed to process and filter this information in real-time for interested parties by monitoring a set of predefined topics, exploiting humans as sensors. An analysis of the relevant information by an operator can result in an estimation of severity, and an operator can act accordingly. However, among all relevant and useful content that is extracted, also a lot of irrelevant noise is present. Our goal is to improve the filter in such a way that the majority of information presented by Twitcident is relevant. To this end we designed an artifact consisting of several components, developed within a dynamic framework. Its major components include a machine learning classifier operating on dynamic features, a semi-automatic setup approach and a training approach. Our prototype operates on Dutch content, but it can be adapted to operate on any language. With a partially implemented prototype of our designed artifact we achieve F2-scores of 0.7 up to 0.9 for our Dutch test-sets using 10-fold cross validation, which is on average a 30% improvement over the existing Twitcident filtering architecture. The artifact is robustly designed, allowing for many forms of future improvements and extensions. We also make some side-contributions, like an approximate matching algorithm for variable length strings.