Research on Vocabulary Optimization of Multilingual Models for Low-Resource Languages

Main Article Content

Zhenghang Tang

Abstract

To enhance the performance of multilingual models on low-resource languages, particularly in downstream tasks such as sentiment analysis, a framework for vocabulary expansion is proposed. This framework selects low-frequency but informative words using Zipf’s Law and optimizes the vocabulary with weighted entropy analysis. Experimental results show improvements in accuracy and macro F1 scores by 3.85% and 5.22% respectively, particularly in tasks involving Hindi product reviews and Hindi-English code-switching. However, the study also notes limitations including performance fluctuations in intermediate stages of vocabulary expansion and a need for further exploration of the strategy’s applicability to other NLP tasks. Despite these issues, the proposed framework provides a valuable method for enhancing the representation of low-resource languages in multilingual models.

Article Details

Section
Articles