Evaluating Neural Text Simplification in the Medical Domain

Master thesis (2019)

Authors

L. van den Bercken Electrical Engineering, Mathematics and Computer Science

Contributors

C. Lofi Web Information Systems - (supervisor 1)

Robert-Jan Sips myTomorrows (supervisor 1)

G.J.P.M. Houben Web Information Systems - (supervisor 2)

Jan S. Rellermeyer Data-Intensive Systems - (supervisor 2)

Faculty

Electrical Engineering, Mathematics and Computer Science

Medical Text Simplification Neural Machine Translation Consumer Health Vocabulary

More Info

expand_more

To reference this document use:

http://resolver.tudelft.nl/uuid:a744b2f3-b256-4ebe-8ed5-f62d90e242e9

Published Date

28-02-2019

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

Health literacy, i.e. the ability to read and understand medical text, is a relevant component of public health. Unfortunately, many medical texts are hard to grasp by the general population as they are targeted at highly-skilled health professionals and use complex language and domain-specific terms. Here, automatic text simplification making text commonly understandable would be very beneficial. In this thesis we evaluate the state-of-the-art in automatic text simplification in the medical domain. We train a Neural Machine Translation (NMT) system on aligned complex and simple sentences from Wikipedia and Simple Wikipedia. As there are no publicly available aligned medical text simplification corpora, we create one semi-automatically with the help of a domain expert and one fully automatically using a novel monolingual alignment method introduced in this thesis. We analyse the effect of in-domain data when training an NMT system. Furthermore, we describe two strategies for medical term simplification in combination with NMT: 1) An extra pre-processing step that boosts medical term simplification 2) A post-processing dictionary approach using the Open-Access and Collaborative Consumer Health Vocabulary (CHV). We analyse the effect of both strategies separately. We let humans evaluate the output on grammar, meaning preservation (from the complex sentence) and simplicity (compared to the complex sentence).

Results show that an NMT trained on general aligned complex and simple sentences is able to simplify medical sentences at the level of Simple Wikipedia. An NMT trained on medical sentences (in addition to general sentences) in combination with the boosting strategy for medical term simplification is able to translate more medical concepts, but the output is not simpler than the NMT trained on general sentences only. Interestingly, NMT in combination with the CHV did not boost simplicity, but had the opposite effect.

Files

Msc_thesis_lbercken_4587545.pd... (.pdf)

(.pdf | 2.37 Mb)