Entropy2Vec: Crosslingual Language Modeling Entropy as End-to-End Learnable Language Representations

Learning language embeddings from model entropy to recover typological structure.

Mentees (4):
Patrick Amadeus IrawanRyandito DiandaruBelati Jagad Bintang SyuhadaRandy Zakya Suchrady

Project Proposal

The problem

Linguistic typology is traditionally built from manual documentation and expert analysis.

LLMs, however, learn language behavior at scale. We asked whether internal language knowledge in LLMs can reflect typological relationships similar to those in resources like Ethnologue and Glottolog.

The approach

The team extracted language representations from cross-lingual language modeling entropy across monolingual language models.

They built Entropy2Vec, a language embedding space where distances between vectors align with typological structure, using model prediction patterns rather than handcrafted linguistic features.

Why this matters

The learned embeddings can recover typological structure and help regularize fine-tuning to improve transfer to unseen languages.

For SEA languages, this opens opportunities to reuse existing models to infer structure and improve generalization with limited data.