Instruction Tuning for Domain Adaptation of Large Language Models
A Case Study in the Field of Education
More Info
expand_more
Abstract
While most large language models (LLMs) are powerful, they are primarily designed for general purposes. Consequently, many enterprises and institutions have now focused on developing domain-specific models. In the realm of education, an expert LLM can significantly enhance students' ability to find information more effectively and reach their learning goals. Nevertheless, the training of such expert models in education remains largely unexplored. This study explores this research gap by developing a framework to transform semi-structured educational web data into structured datasets and perform instruction tuning on foundation models. Additionally, we conduct a comprehensive performance analysis to determine how various training factors affect model performance.
We first employed a systematic and cost-effective approach involving web data extraction, data cleaning, validation, task design based on student surveys, and automated instruction instance generation using LLMs. Human evaluations confirmed the quality, especially the relevance and accuracy of these datasets.
This study then investigates the impact of various training techniques on domain-specific educational large language models (LLMs) performance. Our experiments reveal that further pre-training enhances model performance, especially with domain-specific terminology, although the performance gains decrease as the dataset size increases. Furthermore, multi-task training also improves model relevance, accuracy, and clarity, but less correlated tasks and datasets can present challenges. These challenges include increased complexity and potential degradation in performance due to the model having to switch between diverse tasks. Lastly, this study conducts a comparative analysis of different models and it highlights trade-offs between computational resources and performance.
The findings demonstrate that a structured approach to dataset generation and strategic training can effectively develop domain-specific LLMs in education. This research benefits the development of educational LLMs and provides a foundation for future researchers to build more specialized models in various domains.