Abstract
Data augmentation is a widely used strategy to enhance the predictive power of machine learning (ML) models. This is the case of intent classification problems, where the end goal of a utterance needs to be categorized using text mining techniques. Nevertheless, recent augmentation methods based on general off-the-shelf Large Language Models (LLMs) have room for improvement. They can struggle to effectively capture the nuances associated with domain-specific scenarios. This paper presents a novel utterance augmentation method that uses LLMs and word embedding models to address the issue, particularly in domain-specific problems. The proposed method starts from a given reference set. Then, paraphrases are generated using LLMs to obtain new utterances that are semantically similar to the original ones, but with different word choices and syntax. Next, synonym replacement is accomplished using a previously trained domain-specific word embedding model. This entails the incorporation of relevant vocabulary to a particular topic into the final augmented dataset, effectively capturing the nuances of domain-specific problems. Experiments were conducted to assess the quality of the proposal in two intent classification problems related to financial trading compliance. The proposal has an advantage over many well-known approaches in the first problem, comprising eight reference utterances and 375 challenging examples in general language. In the second problem, with 13 reference utterances and 525 challenging examples in general and financial trading vocabulary, the proposal outperforms the best-analyzed state-of-the-art methods by up to 15%.
Journal Title
Journal ISSN
Volume Title
Publisher
Springer
Date
Description
This research has been supported by VoxSmart Trading S.L., and grants from Madrid Autonomous Community (Ref: IND2023/TIC-28393) and the Spanish Ministry of Science and Innovation, under the Knowledge Generation Projects
program: XMIDAS (Ref: PID2021-122640OB-100)
Citation
Madrueño, N., Fernández-Isabel, A., Cuesta, M. et al. Novel utterance data augmentation for intent classification using large language models. Neural Comput & Applic (2025). https://doi.org/10.1007/s00521-025-11642-3
Collections
Endorsement
Review
Supplemented By
Referenced By
Document viewer
Select a file to preview:
Reload



