Semantic-Enhanced Training Data Augmentation Methods for Long-Tail Entity Recognition Models

None, None

doi:10.4233/uuid:dbbfe1fc-bf63-45f0-8cf2-28ed7dab90eb

Semantic-Enhanced Training Data Augmentation Methods for Long-Tail Entity Recognition Models

Doctoral Thesis (2020)

Author(s)

Sepideh Mesbah (TU Delft - Industrial Design Engineering)

Contributor(s)

Geert-Jan Houben – Promotor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Alessandro Bozzon – Promotor (TU Delft - Industrial Design Engineering)

Christoph Lofi – Copromotor (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Research Group

Web Information Systems

Long-tail Name Entity Recognition Semantic Enrichment Training Data Augmentation

DOI related publication

https://doi.org/10.4233/uuid:dbbfe1fc-bf63-45f0-8cf2-28ed7dab90eb Final published version

To reference this document use

https://doi.org/10.4233/uuid:dbbfe1fc-bf63-45f0-8cf2-28ed7dab90eb

More Info

expand_more

Publication Year

2020

Language

English

Research Group

Web Information Systems

ISBN (print)

978-94-6380-808-8

Downloads counter

484

Collections

Institutional Repository

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Named Entity Recognition (NER) is an essential information retrieval task. It enables a wide range of natural language processing applications such as semantic search, machine translation, etc. The NER can be formulated as the task of identifying and typing words or phrases in a text that refers to certain classes of interest (e.g., disease, Adverse Drug Reactions). There are different techniques to tackle NER, such as dictionary-based, rulebased, and machine learning-based. Machine learning-based NER techniques have shown to perform the best for entities with large amounts of human-labeled training datasets.
However, their performance is limited when dealing with long-tail entities. Long-tail entities are entities that have a low frequency in the document collections and usually have no reference to existing Knowledge Bases. Obtaining human-labeled datasets is expensive and time-consuming, especially for long-tail entities that are scarcely available in document collections. This dissertation focuses on the problem of the lack of training data, arguably the largest bottleneck in training machine learning-based NER techniques. We investigated efficient and effective ways to augment training data by enhancing their size and quality automatically. Our work aimed at showing how, by enhancing the size and quality of the training data using different techniques, it will be possible to improve the performance of Long-tail Entity Recognition (L-tER).

Files

Sepideh_DissertationApril2020.... (pdf)

(pdf | 11.6 Mb)

License info not available