An Exploratory Study on Authorship Verification Models for Forensic Purpose
Z. Li
J. van den Berg – Mentor
M. Franssen – Mentor
C. Veenman – Mentor
More Info
expand_more
Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.
Abstract
Authorship verification is one subfield of authorship analysis. However, the majority of the research in the field of authorship analysis is on the authorship identification problem. The authorship verification problem has received less attention than the authorship identification problem. Thus, there is a demand for a study on the authorship verification problem. The authorship verification problem of digital documents is becoming increasingly important as the criminals or terrorist organizations take advantage of the anonymity of the cyberspace to avoid being punished. Thus, it is critical for forensic linguistic experts to come up with effective methods to verify a short text written by a suspect. This master thesis project provides an exploratory study on the authorship verification models to solve the authorship verification problem. The research problem is as follows: Given a few texts (around 1000 words each) of one author, determine whether the text in dispute is written by the same author. The primary objective of this research is to design several innovative authorship verification models to solve the problem described above. A second goal of this research is to participate in the PAN Contest 2013 in the task of authorship verification. This thesis project explores extensively the possibilities of using compression features to solve the authorship verification. Both one-class classification models and two-class classification models are designed in this project. In a one-class classification model, there is only target class, and the decision is based on a predefined rule. In a two-class classification model, there are both target class and outlier class, and the threshold is decided by learning the boundary between the two classes. In total five models have been designed and evaluated, four of which use compression features. Character N-Gram Model is designed in this research to make a comparison of character-grams and compression features. The initial task of this project is the data collection. In order to participate in the PAN Contest, similar data (engineering textbooks from bookboon.com) were collected. In total 72 books written by 51 authors are in the collected corpus. The Book Collection Corpus was derived from the collected book and was used to develop the models. Additionally, an Enron Email Corpus was used to test the performance of one authorship verification model. As a result, the models designed received desirable performances and have shown potential to solve other similar problems.