Donate to help us build inclusive AI for Southeast Asia
SEA-VL Phase 1

SEA-VL Phase 1

Paper: ACL 2025, arXiv
HuggingFace Collection: Link

Weโ€™re excited to present a major milestone from the SEACrowd team: the launch of SEA-VL, the largest open-source vision-language (VL) dataset specifically designed to represent the cultural diversity of Southeast Asia ๐Ÿ‡ง๐Ÿ‡ณ๐Ÿ‡ฐ๐Ÿ‡ญ๐Ÿ‡น๐Ÿ‡ฑ๐Ÿ‡ฎ๐Ÿ‡ฉ๐Ÿ‡ฑ๐Ÿ‡ฆ๐Ÿ‡ฒ๐Ÿ‡พ๐Ÿ‡ฒ๐Ÿ‡ฒ๐Ÿ‡ต๐Ÿ‡ญ๐Ÿ‡ธ๐Ÿ‡ฌ๐Ÿ‡น๐Ÿ‡ญ๐Ÿ‡ป๐Ÿ‡ณ.

Why SEA-VL?

Most vision-language datasets today reflect Western-centric imagery and language, leaving Southeast Asian cultures underrepresented and misinterpreted. SEA-VL is our open-source initiative to change that. It is designed to better represent the languages, traditions, and everyday realities of Southeast Asian communities.

Highlights of the dataset include:

  • 1.3 million culturally relevant image-text pairs
  • Covers all 11 Southeast Asian countries
  • 50ร— larger than any previous SEA-focused VL dataset
  • Hosted on Hugging Face: Explore SEA-VL

How We Built SEA-VL

We combined several approaches to balance scale with cultural fidelity:

  • Crowdsourcing โ€” High cultural accuracy, but slow and resource-intensive
  • Image Crawling โ€” ~85% cultural relevance and highly scalable
  • Image Generation โ€” Still fails to reflect SEA cultures authentically and poses licensing challenges

For in-depth information on each approach, check out our paper.

Whatโ€™s Next?

We extend our deepest thanks to the contributors across Southeast Asia who made this possible.

This is only the beginning: Phase 2 is on the horizon, and we invite researchers, practitioners, and community members to collaborate with us. Stay tuned on our Discord!

Together, letโ€™s build AI that reflects the full spectrum of culture across Southeast Asia.

Share: