A study of deep active learning methods to reduce labelling efforts in biomedical relation extraction

None, None; None, None; None, None

A study of deep active learning methods to reduce labelling efforts in biomedical relation extraction

Journal Article (2023)

Author(s)

Charlotte Nachtegael (Vrije Universiteit Brussel)

J. De Stefani (Vrije Universiteit Brussel, TU Delft - Information and Communication Technology)

Tom Lenaerts (Vrije Universiteit Brussel)

Research Group

Information and Communication Technology

Copyright

DOI related publication

https://doi.org/10.1371/journal.pone.0292356

To reference this document use:

https://resolver.tudelft.nl/uuid:612ffd7b-093d-45b2-aea4-5495927b5a41

More Info

expand_more

Publication Year

2023

Language

English

Copyright

Research Group

Information and Communication Technology

Issue number

12 December

Volume number

18

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Automatic biomedical relation extraction (bioRE) is an essential task in biomedical research in order to generate high-quality labelled data that can be used for the development of innovative predictive methods. However, building such fully labelled, high quality bioRE data sets of adequate size for the training of state-of-the-art relation extraction models is hindered by an annotation bottleneck due to limitations on time and expertise of researchers and curators. We show here how Active Learning (AL) plays an important role in resolving this issue and positively improve bioRE tasks, effectively overcoming the labelling limits inherent to a data set. Six different AL strategies are benchmarked on seven bioRE data sets, using PubMedBERT as the base model, evaluating their area under the learning curve (AULC) as well as intermediate results measurements. The results demonstrate that uncertainty-based strategies, such as Least-Confident or Margin Sampling, are statistically performing better in terms of F1-score, accuracy and precision, than other types of AL strategies. However, in terms of recall, a diversity-based strategy, called Core-set, outperforms all strategies. AL strategies are shown to reduce the annotation need (in order to reach a performance at par with training on all data), from 6% to 38%, depending on the data set; with Margin Sampling and Least-Confident Sampling strategies moreover obtaining the best AULCs compared to the Random Sampling baseline. We show through the experiments the importance of using AL methods to reduce the amount of labelling needed to construct high-quality data sets leading to optimal performance of deep learning models. The code and data sets to reproduce all the results presented in the article are available at https://github.com/oligogenic/Deep_active_learning_bioRE.

Files

Journal.pone.0292356_1.pdf

(pdf | 1.96 Mb)