SEACrowd Apprentice Program is a remote research program (Feb 1 - Jun 30, 2026) that pairs experienced researchers with early-career researchers across Southeast Asia to build models, datasets, and publishable research. Small, mentored teams work on scoped projects with structured milestones and community support, creating a clear path toward PhD admissions, jobs, and stronger SEA regional capacity.
Each mentee will join a team of 2–3 mentors (at least 1 primary mentor and 1 secondary mentor) and 2+ fellow mentees to execute a research project over four months. The program emphasizes hands-on experience, mentorship, and peer learning, culminating in a research report, which can turn into paper submission to top AI/ML/NLP conferences as arranged by the team.
Mid-term milestone: End of Mar / early Apr (internal review)
End-term milestone: End of Jun (SEACrowd-wide + external committee)
Publications are encouraged when ready, not tied to specific conference deadlines.
Who You’ll Work With
Primary Mentors
Alham Fikri Aji
MBZUAI
Assistant Professor
Samuel Cahyawijaya
Cohere
Member of Technical Staff
Ekapol Chuangsuwanich
Chulalongkorn University
Fajri Koto
MBZUAI
Assistant Professor
Peerat Limkonchotiwat
AI Singapore
Research Fellow
Genta Indra Winata
Capital One AI Foundation,
Senior Applied Scientist
Secondary Mentors
Farid Adilazuarda
University of Edinburgh
PhD Student
M. Dehan Al-Kautsar
MBZUAI
NLP Researcher
David Anugraha
Stanford University
MSc Student
Patomporn Payoungkhamdee
VISTEC
PhD Student
M. Reza Qorib
NUS
Research Fellow
Pume Tuchinda
VISTEC
Research Assistant
Organizers & Research Managers
Aye Hninn Khine
KMUTT, SEACrowd
Holy Lovenia
SEACrowd
FAQs
Applications: Nov 17 – Dec 17, 2025 (23:59 UTC-12)
Selection: mid-Dec 2025 – Jan 19, 2026 (screening + interview)
Team announcement: mid-Jan 2026
Program dates: Feb 1 – Jun 30, 2026
We list the official milestones above. Since the program is remote and part-time, we expect conflicts to be manageable. Any major conflicts should be disclosed in your application, or smoothed out with your team if you’re selected.
There are no formal eligibility or age limits. You can apply if you meet at least one of the following:
Your affiliation (school, organization, or company) is from Southeast Asia (SEA)
You were born and raised in SEA (living there more than 10 years)
You are doing or have done SEA-related research
You can still qualify if your work connects to Southeast Asia. One clear way is to do research related to Southeast Asia, particularly in Machine Learning or Natural Language Processing.
Examples include work on SEA languages, regional datasets, or SEA-specific social or cultural AI challenges.
This alone does not qualify you for the program.
However, speaking a language from Southeast Asia can help, especially if it informs your research interests. We encourage you to highlight any relevant language skills and how they connect to your research goals in your application.
No, applications are individual. However, during the selection process, we may consider team dynamics and try to group mentees with complementary skills and interests.
Prior research experience or publications are not required but can strengthen your application. We value strong foundational knowledge in ML, potential, motivation, and fit with project topics and mentors more than credentials. If you have relevant experience, be sure to highlight it in your application.
We expect mentees to commit at least 10 hours per week (hard minimum). This includes time for meetings, research, coding, writing, and collaboration with mentors and peers. The recommended commitment is 15–20 hours per week for a more immersive experience.
We understand that mentees may have other commitments. The program is designed to be flexible and part-time. We recommend discussing your availability and commitments with your mentors and team members to ensure a manageable workload.
However, we cannot accommodate mentees who cannot commit the minimum required time.
The SEACrowd Apprentice Program is a remote research program.
Yes, the program is completely free to join for all selected mentees. We provide compute necessary for research projects.
The program is unpaid. However, mentees gain valuable research experience, mentorship, networking opportunities, and potential publication avenues.
Unfortunately, we do not have funds to provide stipends or financial support at this time. We recommend seeking external funding sources, scholarships, or institutional support if needed. Otherwise, we welcome you to apply in future cohorts when you have the necessary resources or when we may have funding available.
We use English as the primary language for all program communications, documentation, and deliverables. Communication channels will be team dependent, but we expect teams to primarily use Discord for day-to-day communication, and Google Meet / Zoom / Microsoft Teams for meetings. How frequent meetings are will be up to each team to decide, but likely weekly or bi-weekly check-ins with mentors.
No, mentees can only be accepted to work on one project per cohort to ensure focus and commitment.
Once accepted, mentees are expected to commit to their assigned project for the duration of the program. Switching projects is generally not allowed, as it can disrupt team dynamics and project continuity. If you have significant concerns, please discuss them with the program organizers.
For the 2026 cohort, we are only accepting applications for the predefined research topics listed on the program page. You can suggest ideas for the topic that you’ve chosen, but you cannot propose an entirely new project. The topics typically come from our mentors and organizers based on research gaps in Southeast Asia and their expertise, so we can ensure quality mentorship and project scope.
However, we encourage you to suggest new project ideas for future cohorts by reaching out to us via email at seacrowd.research@gmail.com. You can also reach out to other mentors & collaborators in research communities like Cohere Labs Open Science Community, Eleuther AI, Masakhane, or Nous Research.
While we encourage and support publication efforts, we cannot guarantee that every project will result in a publication. Successful publication depends on various factors, including the quality of the research, relevance to conference/journal themes, and acceptance by peer reviewers.
Yes, all mentees who complete the program will receive a Certificate of Achievement.
Strong contributors may receive letters of recommendation from their mentors upon successful completion of the program. This is typically reserved for mentees who demonstrate significant effort, growth, and contribution to their projects.
We expect you to retain your commitment to the project, but we understand that unforeseen circumstances may arise. If you need to leave the program early, please inform your mentors and the program organizers as soon as possible. We encourage open communication to manage expectations and ensure a smooth transition for your team.
No. Author list is agreed upon the beginning of the project. Anyone who doesn’t contribute to the paper will not be on the author list.
If you’re not selected for this cohort, we encourage you to apply again in future cohorts. We also recommend joining our Google Group and following us on X/Twitter, Facebook, or LinkedIn to stay updated on future opportunities.
Build an environment and evaluation benchmark for agentic LLMs in low-resource languages and underrepresented regions.
Multilingual Agentic for Underrepresented Regions
Build an environment and evaluation benchmark for agentic LLMs in low-resource languages and underrepresented regions.
Project proposal
In this work, we address the gap in enabling LLMs with agentic capabilities for low-resource languages and underrepresented regions. Most existing environments and evaluation benchmarks (e.g., Taubench) are Anglocentric, leaving a critical void in assessing performance across diverse linguistic contexts.
This initiative aims to develop a comprehensive environment and evaluation benchmark that measures agentic LLM effectiveness in underrepresented regions, ensuring these technologies are inclusive and accessible to a broader global audience.
Relevant publications:
τ-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
τ2-Bench: Evaluating Conversational Agents in a Dual-Control Environment
CoRaL: Contextual Relevance and Linguistic Enrichment
A multi-dimensional data curation framework to balance quality, relevance, and cultural coverage in low-resource corpora.
CoRaL: Contextual Relevance and Linguistic Enrichment
A multi-dimensional data curation framework to balance quality, relevance, and cultural coverage in low-resource corpora.
Project proposal
Low-resource language corpora often suffer from noise, domain imbalance, and linguistic mixing, making naive filtering harmful to both quantity and cultural representation.
We present CoRaL, a multi-dimensional framework for selective and context-aware data curation. CoRaL evaluates each text sample across linguistic purity, topical relevance, and cross-lingual interference dimensions, then applies a tiered strategy: retain high-quality data, repair mid-quality samples via self-denoising, contextual regeneration, and human-in-the-loop correction, and discard irreparably noisy text.
Additionally, CoRaL expands coverage by integrating high-resource language data that is culturally or thematically aligned with the target community.
Relevant publications:
Meta-rater: A Multi-dimensional Data Selection Method for Pre-training Language Models
Reasoning Agentic LLM Router
Develop skill-based routing to reduce inference costs while preserving strong generalization.
Reasoning Agentic LLM Router
Develop skill-based routing to reduce inference costs while preserving strong generalization.
Project proposal
Learning to route effectively is crucial for improving the efficiency of LLM inference by leveraging model capabilities. Prior work explores routing strategies, but does not thoroughly examine fine-grained, skill-based routing that can substantially reduce costs while preserving strong generalization.
In this project, we aim to develop and evaluate methods for training routers across diverse tasks and settings. We will investigate reward-based and reasoning-driven approaches, as well as sampling (test-time scaling) techniques, to train routers that make routing decisions grounded in reasoning.
Relevant publications:
RouteLLM: Learning to Route LLMs with Preference Data
AgentRouter: A Knowledge-Graph-Guided LLM Router for Collaborative Multi-Agent Question Answering
MixLLM: Dynamic Routing in Mixed Large Language Models
Symbolic Mixture-of-Experts: Adaptive Skill-based Routing for Heterogeneous Reasoning
An Architecture Search Framework for Inference-Time Techniques
Selective Memory Layer Finetuning
Explore memory-layer finetuning strategies to improve continual learning without catastrophic forgetting.
Selective Memory Layer Finetuning
Explore memory-layer finetuning strategies to improve continual learning without catastrophic forgetting.
Project proposal
We tackle continual learning from an architectural perspective. Instead of LoRA, whose parameters grow with the number of tasks or languages, we explore memory layers where the model can store or learn context by injecting key-value information during inference.
A major issue is brittleness against gradient collapse or catastrophic forgetting during key-value injection. Recent work explores sparse finetuning for memory layers, but it is still unclear which memory-layer components should be tuned to mitigate forgetting.
We will test this empirically by systematically finetuning different components to improve stability and retention. The goal is to make memory-based continual learning both efficient and robust.
Relevant publications:
Empirical Study on Updating Key-Value Memories in Transformer Feed-Forward Layers
Memory Layers at Scale
Research - AI at Meta
Continual Learning via Sparse Memory Finetuning
Knowledge Distillation in Multilingual Vision-Text Model
Distill compact multilingual vision-text embeddings from large multimodal teachers for real-world deployment.
Knowledge Distillation in Multilingual Vision-Text Model
Distill compact multilingual vision-text embeddings from large multimodal teachers for real-world deployment.
Project proposal
We propose a training framework to distill a small vision-text embedding model from a large multimodal teacher. Existing KD approaches often assume a base-sized teacher and focus on monolingual settings, leaving large teachers and multilingual scenarios underexplored.
This project will design a KD framework for large-scale teacher models and multilingual vision-text models. The resulting model should be compact and efficient for real-world scenarios and edge devices.
Relevant publications:
MIEB: Massive Image Embedding Benchmark
Publications
Projects
Language Surgery in Multilingual Large Language Models
Mentors: Samuel Cahyawijaya, Genta Indra Winata, Fajri Koto, Peerat Limkonchotiwat, Alham Fikri Aji
A technique for controlling language use in multilingual LLMs without retraining.
Language Surgery in Multilingual Large Language Models
A technique for controlling language use in multilingual LLMs without retraining.
Mentors: Samuel Cahyawijaya, Genta Indra Winata, Fajri Koto, Peerat Limkonchotiwat, Alham Fikri Aji
Mentees: Joanito Agili Lopo, Muhammad Ravi Shulthan Habibi, Tack Hwa Wong
Multilingual LLMs can drift into the wrong language or mix languages, especially when prompted in under-resourced languages.
We wanted to understand:
How multilingual LLMs organize languages internally
Whether we can steer which language the model generates without retraining
The approach
The team studied how multilingual LLMs organize languages in their latent space and how representations shift across languages.
They developed Inference-Time Language Control (ITLC), a method for nudging models toward more consistent language outputs without retraining.
Why this matters
Low-resource languages often see less stable outputs, more language switching, and lower quality.
ITLC offers a practical way to improve behavior in underrepresented languages while helping us understand how languages are arranged inside these models.
Entropy2Vec: Crosslingual Language Modeling Entropy as End-to-End Learnable Language Representations
Mentors: Alham Fikri Aji, Genta Indra Winata, Fajri Koto, Samuel Cahyawijaya
Learning language embeddings from model entropy to recover typological structure.
Entropy2Vec: Crosslingual Language Modeling Entropy as End-to-End Learnable Language Representations
Learning language embeddings from model entropy to recover typological structure.
Mentors: Alham Fikri Aji, Genta Indra Winata, Fajri Koto, Samuel Cahyawijaya
Mentees: Patrick Amadeus Irawan, Ryandito Diandaru, Belati Jagad Bintang Syuhada, Randy Zakya Suchrady
Linguistic typology is traditionally built from manual documentation and expert analysis.
LLMs, however, learn language behavior at scale. We asked whether internal language knowledge in LLMs can reflect typological relationships similar to those in resources like Ethnologue and Glottolog.
The approach
The team extracted language representations from cross-lingual language modeling entropy across monolingual language models.
They built Entropy2Vec, a language embedding space where distances between vectors align with typological structure, using model prediction patterns rather than handcrafted linguistic features.
Why this matters
The learned embeddings can recover typological structure and help regularize fine-tuning to improve transfer to unseen languages.
For SEA languages, this opens opportunities to reuse existing models to infer structure and improve generalization with limited data.
SEADialogues: A Multilingual Culturally Grounded Multi-turn Dialogue Dataset on Southeast Asian Languages
Underrepresented languages often perform worse and fail to align with local norms and expectations in multi-turn conversations.
The approach
The team built a culturally grounded, multi-turn dialogue dataset for SEA languages. They collected, curated, and annotated conversations that reflect local cultural practices.
The dataset supports both training and evaluation of value-aware conversational models.
Why this matters
SEADialogues provides a targeted resource for improving conversational AI in SEA languages, a benchmark for evaluating value alignment, and a starting point for future work on dialogue safety and social norms in SEA contexts.
Joanito Agili Lopo, Muhammad Ravi Shulthan Habibi, Tack Hwa Wong, Muhammad Ilham Ghozali, Fajri Koto, Genta Indra Winata, Peerat Limkonchotiwat, Alham Fikri Aji, Samuel Cahyawijaya
Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025)
• 2025
Abstract
Large Language Models (LLMs) have demonstrated remarkable generalization capabilities across tasks and languages, revolutionizing natural language processing. This paper investigates the naturally emerging representation alignment in LLMs, particularly in the middle layers, and its implications for disentangling language-specific and language-agnostic information. We empirically confirm the existence of this alignment, analyze its behavior in comparison to explicitly designed alignment models, and demonstrate its potential for language-specific manipulation without semantic degradation. Building on these findings, we propose Inference-Time Language Control (ITLC), a novel method that leverages latent injection to enable precise cross-lingual language control and mitigate language confusion in LLMs. Our experiments highlight ITLC’s strong cross-lingual control capabilities while preserving semantic integrity in target languages. Furthermore, we demonstrate its effectiveness in alleviating the cross-lingual language confusion problem, which persists even in current large-scale LLMs, leading to inconsistent language generation. This work advances our understanding of representation alignment in LLMs and introduces a practical solution for enhancing their monolingual and cross-lingual performance.
Patrick Amadeus Irawan, Ryandito Diandaru, Belati Jagad Bintang Syuhada, Randy Zakya Suchrady, Alham Fikri Aji, Genta Indra Winata, Fajri Koto, Samuel Cahyawijaya
Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025)
• 2025
Abstract
We introduce Entropy2Vec, a novel framework for deriving cross-lingual language representations by leveraging the entropy of monolingual language models. Unlike traditional typological inventories that suffer from feature sparsity and static snapshots, Entropy2Vec uses the inherent uncertainty in language models to capture typological relationships between languages. By training a language model on a single language, we hypothesize that the entropy of its predictions reflects its structural similarity to other languages: Low entropy indicates high similarity, while high entropy suggests greater divergence. This approach yields dense, non-sparse language embeddings that are adaptable to different timeframes and free from missing values. Empirical evaluations demonstrate that Entropy2Vec embeddings align with established typological categories and achieved competitive performance in downstream multilingual NLP tasks, such as those addressed by the LinguAlchemy framework.
Muhammad Dehan Al Kautsar, Aswin Candra, Muhammad Alif Al Hakim, Maxalmina Satria Kahfi, Fajri Koto, Alham Fikri Aji, Peerat Limkonchotiwat, Ekapol Chuangsuwanich, Genta Indra Winata
Preprint
• 2025
Abstract
Although numerous datasets have been developed to support dialogue systems, most existing chit-chat datasets overlook the cultural nuances inherent in natural human conversations. To address this gap, we introduce SEADialogues, a culturally grounded dialogue dataset centered on Southeast Asia, a region with over 700 million people and immense cultural diversity. Our dataset features dialogues in eight languages from six Southeast Asian countries, many of which are low-resource despite having sizable speaker populations. To enhance cultural relevance and personalization, each dialogue includes persona attributes and two culturally grounded topics that reflect everyday life in the respective communities. Furthermore, we release a multi-turn dialogue dataset to advance research on culturally aware and human-centric large language models, including conversational dialogue agents.