TSE-NER

None, None; None, None; None, None; None, None; None, None

TSE-NER

An Iterative Approach for Long-Tail Entity Extraction in Scientific Publications

Conference Paper (2018)

Author(s)

Sepideh Mesbah (TU Delft - Web Information Systems)

Christoph Lofi (TU Delft - Web Information Systems)

Manuel Valle Torre (TU Delft - Web Information Systems)

Alessandro Bozzon (TU Delft - Web Information Systems)

Geert-Jan Houben (TU Delft - Web Information Systems)

Research Group

Web Information Systems

DOI related publication

https://doi.org/10.1007/978-3-030-00671-6_8

To reference this document use:

https://resolver.tudelft.nl/uuid:91b0bf60-1304-4b2f-ba55-f58f04351381

More Info

expand_more

Publication Year

2018

Language

English

Research Group

Web Information Systems

Bibliographical Note

Green Open Access added to TU Delft Institutional Repository ‘You share, we take care!’ – Taverne project https://www.openaccess.nl/en/you-share-we-take-care Otherwise as indicated in the copyright section: the publisher is the copyright holder of this work and the author uses the Dutch legislation to make this work public.@en

Pages (from-to)

127-143

Publisher

Springer

ISBN (print)

978-3-030-00670-9

ISBN (electronic)

978-3-030-00671-6

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Named Entity Recognition and Typing (NER/NET) is a challenging task, especially with long-tail entities such as the ones found in scientific publications. These entities (e.g. “WebKB”, “StatSnowball”) are rare, often relevant only in specific knowledge domains, yet important for retrieval and exploration purposes. State-of-the-art NER approaches employ supervised machine learning models, trained on expensive typelabeled data laboriously produced by human annotators. A common workaround is the generation of labeled training data from knowledge bases; this approach is not suitable for long-tail entity types that are, by definition, scarcely represented in KBs.
This paper presents an iterative approach for training NER and NET
classifiers in scientific publications that relies on minimal human input,
namely a small seed set of instances for the targeted entity type. We
introduce different strategies for training data extraction, semantic expansion, and result entity filtering.We evaluate our approach on scientific
publications, focusing on the long-tail entities types Datasets, Methods in
computer science publications, and Proteins in biomedical publications.

Files

Mesbah2018_Chapter_TSE_NERAnIt... (pdf)

(pdf | 0.968 Mb)

- Embargo expired in 18-02-2019

License info not available