Building Interactive Text-to-SQL Systems

Master thesis (2022)

Authors

R.W. Koops Electrical Engineering, Mathematics and Computer Science

Contributors

G.J.P.M. Houben Web Information Systems - (mentor)

Ujwal Gadiraju Web Information Systems - (mentor)

J. Brons ING (mentor)

G. Lan Embedded Systems - (graduation committee member)

Faculty

Electrical Engineering, Mathematics and Computer Science, Electrical Engineering, Mathematics and Computer Science

More Info

expand_more

To reference this document use:

http://resolver.tudelft.nl/uuid:8ccc193c-35db-472f-a013-fe9aa87b44e7

Published Date

24-05-2022

Language

English

Reuse Rights

Other than for strictly personal use, it is not permitted to download, forward or distribute the text or part of it, without the consent of the author(s) and/or copyright holder(s), unless the work is under an open content license such as Creative Commons.

Faculty

Electrical Engineering, Mathematics and Computer Science

Abstract

Natural Language Interfaces for Databases (NLIDBs) offer a way for users to reason about data. It does not require the user to know the data structure, its relations, or familiarity with a query language like SQL. It only requires the use of Natural Language. This thesis focuses on a subset of NLIDBs, namely those with 'plain English' sentences as input and SQL queries as output.

Study 1 recruits participants from multiple origins (i.e. academia, a crowdsourcing platform, banking industry) without selection based on their query language capabilities. Next, participants are segmented based on query language capabilities to distinguish between non-experts and experts. A common way to retrieve information from databases is by using SQL. Thus knowledge of SQL is assumed to be a proxy for participants' skill level (i.e. SQL proficient, non-SQL proficient). We create an approach that uses an automated near semantic equivalence evaluation for user-generated queries against a predefined gold-standard SQL query and thus segment participants. We find that 70 out of 242 participants are identified as SQL proficient. To differentiate between the segmentations, we define 42 requirements often implemented for NLIDB systems, from which both segmentations pick a selection as their preferred requirements. We are unable to find statistically significant differences between the segmentations' preferences. However, exploratory findings reveal the importance of origin, namely the banking industry, which prefers explanation over answer accuracy, different from other segmentations.

Study 2 is inspired by the exploratory findings of Study 1 and uses requirements from Study 1 to create an application that tests two conditions, one with an explanation by using color-coding (i.e. to show the relations between the natural language question asked and the models' output columns) and another without. NLIDBs make it hard for users to verify if the answer provided by its model is correct. Therefore, Study 2 uses these two conditions above to test if color-coding improves performance for the participants. Our findings suggest that color-coding only improves performance for non-aggregate selection queries with multiple columns.

Files

Reinier_Koops_MSc_Thesis.pdf

(.pdf | 4.18 Mb)