CoRaL: Contextual Relevance and Linguistic Enrichment

A multi-dimensional data curation framework to balance quality, relevance, and cultural coverage in low-resource corpora.

Mentors:
Fajri Koto M. Dehan Al-Kautsar
Mentees (3):
Thanh-Nhi NguyenFeliks Victor Parningotan SamosirMichael Christlambert Sinanta

Project Proposal

Low-resource language corpora often suffer from noise, domain imbalance, and linguistic mixing, making naive filtering harmful to both quantity and cultural representation.

We present CoRaL, a multi-dimensional framework for selective and context-aware data curation. CoRaL evaluates each text sample across linguistic purity, topical relevance, and cross-lingual interference dimensions, then applies a tiered strategy: retain high-quality data, repair mid-quality samples via self-denoising, contextual regeneration, and human-in-the-loop correction, and discard irreparably noisy text.

Additionally, CoRaL expands coverage by integrating high-resource language data that is culturally or thematically aligned with the target community.

Relevant publications:

  • Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models