Effective crowdsourced generation of training data for chatbots natural language understanding

None, None; None, None; None, None

Effective crowdsourced generation of training data for chatbots natural language understanding

Conference Paper (2018)

Author(s)

Rucha Bapat (Student TU Delft)

Pavel Kucherbaev (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Alessandro Bozzon (TU Delft - Electrical Engineering, Mathematics and Computer Science)

Research Group

Web Information Systems

Crowdsourcing Conversational agents Natural language understanding

DOI related publication

https://doi.org/10.1007/978-3-319-91662-0_8 Final published version

To reference this document use

https://resolver.tudelft.nl/uuid:93fdad6e-7d97-4441-b3ca-a1e9dde637d7

More Info

expand_more

Publication Year

2018

Language

English

Research Group

Web Information Systems

Bibliographical Note

Accepted Author Manuscript

Pages (from-to)

114-128

Publisher

Springer

ISBN (print)

978-3-319-91661-3

ISBN (electronic)

978-3-319-91662-0

Event

18th International Conference on Web Engineering, ICWE 2018 (2018-06-05 - 2018-06-08), Caceres, Spain

Downloads counter

331

Collections

Institutional Repository

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Abstract

Chatbots are text-based conversational agents. Natural Language Understanding (NLU) models are used to extract meaning and intention from user messages sent to chatbots. The user experience of chatbots largely depends on the performance of the NLU model, which itself largely depends on the initial dataset the model is trained with. The training data should cover the diversity of real user requests the chatbot will receive. Obtaining such data is a challenging task even for big corporations. We introduce a generic approach to generate training data with the help of crowd workers, we discuss the approach workflow and the design of crowdsourcing tasks assuring high quality. We evaluate the approach by running an experiment collecting data for 9 different intents. We use the collected training data to train a natural language understanding model. We analyse the performance of the model under different training set sizes for each intent. We provide recommendations on selecting an optimal confidence threshold for predicting intents, based on the cost model of incorrect and unknown predictions.

Files

Paper61_ICWE2018_NLU.pdf

(pdf | 0.435 Mb)

License info not available