A study of deep active learning methods to reduce labelling efforts in biomedical relation extraction

Journal Article (2023)
Author(s)

Charlotte Nachtegael (Vrije Universiteit Brussel)

J. De Stefani (Vrije Universiteit Brussel, TU Delft - Information and Communication Technology)

Tom Lenaerts (Vrije Universiteit Brussel)

Research Group
Information and Communication Technology
Copyright
© 2023 Charlotte Nachtegael, J. De Stefani, Tom Lenaerts
DOI related publication
https://doi.org/10.1371/journal.pone.0292356
More Info
expand_more
Publication Year
2023
Language
English
Copyright
© 2023 Charlotte Nachtegael, J. De Stefani, Tom Lenaerts
Research Group
Information and Communication Technology
Issue number
12 December
Volume number
18
Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Automatic biomedical relation extraction (bioRE) is an essential task in biomedical research in order to generate high-quality labelled data that can be used for the development of innovative predictive methods. However, building such fully labelled, high quality bioRE data sets of adequate size for the training of state-of-the-art relation extraction models is hindered by an annotation bottleneck due to limitations on time and expertise of researchers and curators. We show here how Active Learning (AL) plays an important role in resolving this issue and positively improve bioRE tasks, effectively overcoming the labelling limits inherent to a data set. Six different AL strategies are benchmarked on seven bioRE data sets, using PubMedBERT as the base model, evaluating their area under the learning curve (AULC) as well as intermediate results measurements. The results demonstrate that uncertainty-based strategies, such as Least-Confident or Margin Sampling, are statistically performing better in terms of F1-score, accuracy and precision, than other types of AL strategies. However, in terms of recall, a diversity-based strategy, called Core-set, outperforms all strategies. AL strategies are shown to reduce the annotation need (in order to reach a performance at par with training on all data), from 6% to 38%, depending on the data set; with Margin Sampling and Least-Confident Sampling strategies moreover obtaining the best AULCs compared to the Random Sampling baseline. We show through the experiments the importance of using AL methods to reduce the amount of labelling needed to construct high-quality data sets leading to optimal performance of deep learning models. The code and data sets to reproduce all the results presented in the article are available at https://github.com/oligogenic/Deep_active_learning_bioRE.