Plagiarism detection by similarity join

More Info
expand_more

Abstract

Since the internet is so big and most of its content is public, it is very hard to find out where the information came from originally. There are many websites that publish news articles, so people and organizations can easily lose track of where their articles are reused with or without their permission. This paper presents a plagiarism detection algorithm that allows us to quickly compare online news articles with a collection of personal news articles and detect plagiarized passages with the same quality as a human. The algorithm uses a basic shingle index and a Signature Tree as a more advanced pre-filtering step to narrow down the viable documents to a query. The algorithm achieves a score of 0.96 precision and 0.94 recall but is too resource intensive to be considered scalable. When only the pre-filtering step is used, it achieves 0.85 precision and recall creating a speedup of nearly one order of magnitude.