Paper: ACL 2025, arXiv
HuggingFace Collection:
Link
Weโre excited to present a major milestone from the SEACrowd team: the launch of SEA-VL, the largest open-source vision-language (VL) dataset specifically designed to represent the cultural diversity of Southeast Asia ๐ง๐ณ๐ฐ๐ญ๐น๐ฑ๐ฎ๐ฉ๐ฑ๐ฆ๐ฒ๐พ๐ฒ๐ฒ๐ต๐ญ๐ธ๐ฌ๐น๐ญ๐ป๐ณ.
Why SEA-VL?
Most vision-language datasets today reflect Western-centric imagery and language, leaving Southeast Asian cultures underrepresented and misinterpreted. SEA-VL is our open-source initiative to change that. It is designed to better represent the languages, traditions, and everyday realities of Southeast Asian communities.
Highlights of the dataset include:
- 1.3 million culturally relevant image-text pairs
 - Covers all 11 Southeast Asian countries
 - 50ร larger than any previous SEA-focused VL dataset
 - Hosted on Hugging Face: Explore SEA-VL
 
How We Built SEA-VL
We combined several approaches to balance scale with cultural fidelity:
- Crowdsourcing โ High cultural accuracy, but slow and resource-intensive
 - Image Crawling โ ~85% cultural relevance and highly scalable
 - Image Generation โ Still fails to reflect SEA cultures authentically and poses licensing challenges
 
For in-depth information on each approach, check out our paper.
Whatโs Next?
We extend our deepest thanks to the contributors across Southeast Asia who made this possible.
This is only the beginning: Phase 2 is on the horizon, and we invite researchers, practitioners, and community members to collaborate with us. Stay tuned on our Discord!
Together, letโs build AI that reflects the full spectrum of culture across Southeast Asia.