Enhancing Real-Time Twitter Filtering and Classification using a Semi-Automatic Dynamic Machine Learning setup approach

Master thesis (2015)

Authors

N. De Jong

Contributors

G.J.P.M. Houben (mentor)

C. Hauff (mentor)

R.J.P. Stronkman (mentor)

Programme

Web Information Systems () (TU Delft)

Machine learning Classification Twitter Real-time filtering Social filtering Twitcident

To reference this document use:

http://resolver.tudelft.nl/uuid:1bfec308-2c16-4cd3-baf6-35062a885ad7

More Info

expand_more

Published Date

31-08-2015

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Programme

Web Information Systems

Abstract

Twitter contains massive amounts of user generated content that also contains a lot of valuable information for various interested parties. Twitcident has been developed to process and filter this information in real-time for interested parties by monitoring a set of predefined topics, exploiting humans as sensors. An analysis of the relevant information by an operator can result in an estimation of severity, and an operator can act accordingly. However, among all relevant and useful content that is extracted, also a lot of irrelevant noise is present. Our goal is to improve the filter in such a way that the majority of information presented by Twitcident is relevant. To this end we designed an artifact consisting of several components, developed within a dynamic framework. Its major components include a machine learning classifier operating on dynamic features, a semi-automatic setup approach and a training approach. Our prototype operates on Dutch content, but it can be adapted to operate on any language. With a partially implemented prototype of our designed artifact we achieve F2-scores of 0.7 up to 0.9 for our Dutch test-sets using 10-fold cross validation, which is on average a 30% improvement over the existing Twitcident filtering architecture. The artifact is robustly designed, allowing for many forms of future improvements and extensions. We also make some side-contributions, like an approximate matching algorithm for variable length strings.

Files

2015-08-11_Final_Thesis_NickDe... (pdf)

(pdf | 7.51 Mb)