CoRaL: Contextual Relevance and Linguistic Enrichment
A multi-dimensional data curation framework to balance quality, relevance, and cultural coverage in low-resource corpora.
Project Proposal
Low-resource language corpora often suffer from noise, domain imbalance, and linguistic mixing, making naive filtering harmful to both quantity and cultural representation.
We present CoRaL, a multi-dimensional framework for selective and context-aware data curation. CoRaL evaluates each text sample across linguistic purity, topical relevance, and cross-lingual interference dimensions, then applies a tiered strategy: retain high-quality data, repair mid-quality samples via self-denoising, contextual regeneration, and human-in-the-loop correction, and discard irreparably noisy text.
Additionally, CoRaL expands coverage by integrating high-resource language data that is culturally or thematically aligned with the target community.
Relevant publications:
- Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models