Research on Vocabulary Optimization of Multilingual Models for Low-Resource Languages
Main Article Content
Abstract
To enhance the performance of multilingual models on low-resource languages, particularly in downstream tasks such as sentiment analysis, a framework for vocabulary expansion is proposed. This framework selects low-frequency but informative words using Zipf’s Law and optimizes the vocabulary with weighted entropy analysis. Experimental results show improvements in accuracy and macro F1 scores by 3.85% and 5.22% respectively, particularly in tasks involving Hindi product reviews and Hindi-English code-switching. However, the study also notes limitations including performance fluctuations in intermediate stages of vocabulary expansion and a need for further exploration of the strategy’s applicability to other NLP tasks. Despite these issues, the proposed framework provides a valuable method for enhancing the representation of low-resource languages in multilingual models.