Research on Vocabulary Optimization of Multilingual Models for Low-Resource Languages

Zhenghang Tang

doi:10.52710/cfs.537

PDF

Published: Mar 25, 2025

DOI: https://doi.org/10.52710/cfs.537

Keywords:

Sentiment Analysis, Zipf’s law, Entropy, Low resources, Vocabulary

Zhenghang Tang

Abstract

To enhance the performance of multilingual models on low-resource languages, particularly in downstream tasks such as sentiment analysis, a framework for vocabulary expansion is proposed. This framework selects low-frequency but informative words using Zipf’s Law and optimizes the vocabulary with weighted entropy analysis. Experimental results show improvements in accuracy and macro F1 scores by 3.85% and 5.22% respectively, particularly in tasks involving Hindi product reviews and Hindi-English code-switching. However, the study also notes limitations including performance fluctuations in intermediate stages of vocabulary expansion and a need for further exploration of the strategy’s applicability to other NLP tasks. Despite these issues, the proposed framework provides a valuable method for enhancing the representation of low-resource languages in multilingual models.

Issue

Volume 2025, Issue 2

Section

Articles

Article Sidebar

Main Article Content

Abstract

Article Details