Knowledge Distillation in Multilingual Vision-Text Model

Distill compact multilingual vision-text embeddings from large multimodal teachers for real-world deployment.

Mentees (4):
Ashvanth SFaiz Assabil FirdausIlma Aliya FiddienPuja Ahmad Habibi

Project Proposal

We propose a training framework to distill a small vision-text embedding model from a large multimodal teacher. Existing KD approaches often assume a base-sized teacher and focus on monolingual settings, leaving large teachers and multilingual scenarios underexplored.

This project will design a KD framework for large-scale teacher models and multilingual vision-text models. The resulting model should be compact and efficient for real-world scenarios and edge devices.

Relevant publications:

  • MIEB: Massive Image Embedding Benchmark