TL;DR: We’re at the mid-term point of our SEACrowd Apprentice Program 2026 cohort, and the energy is electric! Five mentored teams with a total of 21 mentees from Southeast Asia have made remarkable progress across cutting-edge research projects focused on language model and AI. This post shares our exciting mid-term achievements, introduces the 2026 project teams, and opens the door for mentors to join our 2027 batch.
Interested in mentoring the 2027 SEACrowd Apprentice Program? Fill the interest form here.
Background
As we reach the mid-term milestone of our SEACrowd Apprentice Program 2026 by mid April 2026, we’re thrilled to share the incredible progress our teams have made. The program pairs experienced researchers with early-career talent across Southeast Asia to build models, datasets, and publishable research — and this cohort is delivering on that promise in spectacular fashion.
Mid-Term Momentum: Five Teams, Five Success Stories
Our five mentored teams have hit the ground running with structured milestones and community support. Here’s what they’ve accomplished so far:
Team 1: Multilingual Agentic for Underrepresented Regions
Mentors: Samuel Cahyawijaya, Patomporn Payoungkhamdee
The Problem: Recent agent-based LLM evaluation frameworks are focused on capabilities in English, leaving behind regional AI sovereignty by only prioritizing externally developed assessment paradigms over locally relevant metrics and governance framework across diverse linguistic and cultural context.
The Approach: The team develops a comprehensive evaluation framework that tests agentic capabilities specifically for uncovering multifaceted gaps when transitioning from an English-only to a more localized evaluation framework with a case study on Southeast Asian regional adaptation.
Why This Matters: This work underscores that multilinguality remains largely unsolved in the current era of agentic AI, while pioneering a crucial blueprint for sovereign agentic AI assessment and inspiring future research to develop more rigorous methods and assessment frameworks for AI sovereignty that align with recent technological advancements.
Mid-Term Progress: The team has completed their initial development of the evaluation framework development and is now running pilot tests across several Southeast Asian languages including Chinese, Indonesian, Thai, and Vietnamese. Early results show clear robustness gap when adapting the linguistic nuances from English to other languages, with the framework successfully identifying language-specific agentic failure modes that weren’t captured in existing English-only assessment.
Team 2: CoRaL: Contextual Relevance and Linguistic Enrichment for Balanced Data Curation in Low-Resource Language Pre-Training
Mentors: Fajri Koto, Muhammad Dehan Al Kautsar
The Problem: Low-resource language corpora are typically plagued by noise, domain imbalance, and code-mixing. Naive filtering pipelines often make things worse — discarding scarce data and erasing the cultural signal that makes these corpora valuable in the first place. The result: training data that is neither high quality nor culturally representative.
The Approach: The team is building CoRaL, a context-aware data curation framework that transforms noisy raw text into high-quality, culturally grounded training data. Beyond filtering, CoRaL also explores dictionary-driven augmentation to expand corpora for languages where naturally occurring text is scarce.
Why This Matters: Southeast Asia is home to hundreds of millions of speakers across dozens of languages, yet most remain critically underrepresented in modern LLMs. CoRaL aims to close this gap by curating and expanding pre-training data for these languages, with a focus on preserving linguistic and cultural fidelity rather than optimizing for raw token counts.
Mid-Term Progress: The team has defined seven quality dimensions and built an evaluation pipeline to identify the most reliable LLM-as-a-judge, measured by correlation with human annotations. In parallel, they have curated dictionaries for several Southeast Asian languages and are now investigating how to leverage these resources to prompt LLMs into producing fluent, contextually appropriate sentences — even for languages the models have never seen during pre-training.
Team 3: Reasoning Agentic LLM Router
Mentors: Genta Indra Winata, David Anugraha
The Problem: Existing routing strategies don’t thoroughly examine fine-grained, skill-based routing that can substantially reduce inference costs while preserving strong generalization.
The Approach: The team is developing skill-based routing mechanisms using agentic LLM-driven approaches that leverage curated descriptions of model strengths and capabilities. They are also investigating sampling techniques for training routers that can make reasoning-grounded decisions more effectively.
Why This Matters: This work will make LLM inference more efficient and cost-effective, particularly important for resource-constrained environments in Southeast Asia.
Mid-Term Progress: The team has implemented initial routing prototypes and conducted comparative analysis across diverse tasks. They’re seeing promising results in cost reduction while maintaining performance benchmarks, with some configurations achieving 35% inference cost savings without significant accuracy drops.
Team 4: Selective Memory Layer Finetuning
Mentors: Alham Fikri Aji, Farid Adilazuarda, Muhammad Reza Qorib
The Problem: Performing long-context inference with large language models (LLMs) is costly because the memory and computation required for the KV cache scale with the input length. Recently, a compressed memory caching method called Cartridge was proposed to reduce KV-cache memory usage while maintaining long-context capabilities. However, challenges remain in adapting these compressed memories to new knowledge and composing information across contexts.
The Approach: The team is investigating ideas for building a general-purpose memory system for lifelong models that can manage its own context, continually learn, and support modular composition.
Why This Matters: This research could enable more efficient and robust continual learning and long-context inference for LLMs.
Mid-Term Progress: The team has empirically investigated limitations in Cartridge’s adaptability to new knowledge and its composability across contexts.
Team 5: Knowledge Distillation in Multilingual Vision-Text Models
Mentors: Peerat Limkonchotiwat, Pume Tuchinda, Ekapol Chuangsuwanich
The Problem: Existing knowledge distillation approaches focus on monolingual settings and assume base-sized teachers, leaving large teachers and multilingual scenarios underexplored.
The Approach: The team is designing a KD framework specifically for large-scale teacher models and multilingual vision-text scenarios. They’re creating compact embeddings suitable for edge devices.
Why This Matters: The resulting models will be efficient enough for real-world deployment in Southeast Asian contexts while maintaining multilingual capabilities.
Mid-Term Progress: The team successfully reproduced the CLIP-KD system for multilingual applications. By utilizing the existing system with a multilingual tokenizer, we confirmed that our new configuration supports small-scale multilingual text-vision models. The next step is to train the model on large-scale multilingual datasets.
Mentor Voices: Why We Love This Program
Click on a mentor to read their reflection on the program.
Why Mentor with SEACrowd in 2027?
If you’ve been inspired by the progress of our 2026 cohort, we invite you to join us as a mentor for the 2027 SEACrowd Apprentice Program! Here’s why you should consider getting involved:
- Impact Southeast Asia’s AI Future: Directly shape the next generation of AI researchers in our region
- Work with Exceptional Talent: Mentor bright early-career researchers who are hungry to learn and contribute
- Collaborative Community: Join a network of leading researchers and practitioners across Southeast Asia and beyond
- Meaningful Research: Contribute to projects that address real challenges faced by Southeast Asian communities
- Professional Growth: Develop leadership and teaching skills while staying at the forefront of AI research
How to Get Involved
We’re currently looking for quality mentors the 2027 SEACrowd Apprentice Program. If you’re an experienced researcher or practitioner in AI/ML/NLP with a passion for mentoring and Southeast Asian development, we want to hear from you!
Program Details: The 2027 program will run from February to June 2027, following the same successful structure as our 2026 cohort with remote collaboration, structured milestones, and community support.
Final Thought
The mid-term progress of our SEACrowd Apprentice Program 2026 cohort demonstrates the incredible potential of Southeast Asian AI talent when given the right opportunities and support. From developing culturally-aware evaluation frameworks to creating efficient multilingual models, our teams are tackling challenges that matter to our region.
This program represents more than just research; it’s about building a sustainable AI ecosystem in Southeast Asia, one talented researcher at a time. The journey from “I’m curious about AI research” to “I’ve actually done it” is challenging but incredibly rewarding, as we’ve seen with our 2024-25 cohort who published papers at 5th Workshop on Multilingual Representation Learning (MRL) @ EMNLP 2025.
We’re making structural improvements for 2027 based on our learnings, but the core mission remains the same: to create meaningful pathways for Southeast Asian researchers to contribute to the global AI conversation.
If you’re inspired by this progress and want to be part of shaping the future of AI in Southeast Asia, whether as a mentor or supporter, we invite you to join us on this journey.
Connect with us:
- Website: https://seacrowd.org
- Discord: https://discord.gg/XXRHFuvkTA
- Email: seacrowd.research@gmail.com
Let’s continue building AI for Southeast Asia, by Southeast Asians!