SEACrowd 2026 Apprentice Program

SEACrowd 2026 Apprentice Program

The application for 2026 Apprentice Program was closed on Dec 17, 2025 at 23:59 (Anywhere on Earth, UTC-12).

If you would like to be notified of future apprentice batches, please join our Google Group or follow us on X/Twitter, Facebook, or LinkedIn.


SEACrowd Apprentice Program is a remote research program (Feb 1 - Jun 30, 2026) that pairs experienced researchers with early-career researchers across Southeast Asia to build models, datasets, and publishable research. Small, mentored teams work on scoped projects with structured milestones and community support, creating a clear path toward PhD admissions, jobs, and stronger SEA regional capacity.

Each mentee will join a team of 2–3 mentors (at least 1 primary mentor and 1 secondary mentor) and 2+ fellow mentees to execute a research project over four months. The program emphasizes hands-on experience, mentorship, and peer learning, culminating in a research report, which can turn into paper submission to top AI/ML/NLP conferences as arranged by the team.

2025 Success Story

Our 2025 cohort featured three apprentice teams who successfully completed their projects, culminating in mentees’ first-authored research papers published at the 5th Multilingual Representation Learning Workshop (2025). Check out their publications!

We shared their journey and our learnings from running the first cohort in our Retrospective blog post.

2026 Research Topics

We offer five cutting-edge research projects:

  1. Multilingual Agentic for Underrepresented Regions - Developing evaluation frameworks for agent-based language models in low-resource languages
  2. CoRaL: Contextual Relevance and Linguistic Enrichment - Multi-dimensional data curation framework for low-resource language training corpora
  3. Reasoning Agentic LLM Router - Skill-based routing mechanisms to reduce inference costs while maintaining performance
  4. Selective Memory Layer Finetuning - Architectural solutions for continual learning using memory layers
  5. Knowledge Distillation in Multilingual Vision-Text Models - Creating compact vision-language embeddings for edge devices

View detailed project topics →

Who Can Apply

There are no formal eligibility or age limits. We’re a growth-first programme and value potential, motivation, and effort more than credentials.

We welcome anyone who meets at least one of the following:

and share our vision.

What Increases Your Chances

What You’ll Gain

Check out previous batch publications!

Application Process

Apply Here

Program Schedule

Publications are encouraged when ready, not tied to specific conference deadlines.

Who You’ll Work With

Primary Mentors

Alham Fikri Aji
Alham Fikri Aji
MBZUAI
Assistant Professor
Samuel Cahyawijaya
Samuel Cahyawijaya
Cohere
Member of Technical Staff
Ekapol Chuangsuwanich
Ekapol Chuangsuwanich
Chulalongkorn University
Fajri Koto
Fajri Koto
MBZUAI
Assistant Professor
Peerat Limkonchotiwat
Peerat Limkonchotiwat
AI Singapore
Research Fellow
Genta Indra Winata
Genta Indra Winata
Capital One AI Foundation,
Senior Applied Scientist

Secondary Mentors

Farid Adilazuarda
Farid Adilazuarda
University of Edinburgh
PhD Student
M. Dehan Al-Kautsar
M. Dehan Al-Kautsar
MBZUAI
NLP Researcher
David Anugraha
David Anugraha
Stanford University
MSc Student
Patomporn Payoungkhamdee
Patomporn Payoungkhamdee
VISTEC
PhD Student
M. Reza Qorib
M. Reza Qorib
NUS
Research Fellow
Pume Tuchinda
Pume Tuchinda
VISTEC
Research Assistant

Organizers & Research Managers

Aye Hninn Khine
Aye Hninn Khine
KMUTT, SEACrowd
Holy Lovenia
Holy Lovenia
SEACrowd

FAQs

  • Applications: Nov 17 – Dec 17, 2025 (23:59 UTC-12)
  • Selection: mid-Dec 2025 – Jan 19, 2026 (screening + interview)
  • Team announcement: mid-Jan 2026
  • Program dates: Feb 1 – Jun 30, 2026

We list the official milestones above. Since the program is remote and part-time, we expect conflicts to be manageable. Any major conflicts should be disclosed in your application, or smoothed out with your team if you’re selected.

There are no formal eligibility or age limits. You can apply if you meet at least one of the following:

  • Your affiliation (school, organization, or company) is from Southeast Asia (SEA)
  • You were born and raised in SEA (living there more than 10 years)
  • You are doing or have done SEA-related research

You can still qualify if your work connects to Southeast Asia. One clear way is to do research related to Southeast Asia, particularly in Machine Learning or Natural Language Processing.

Examples include work on SEA languages, regional datasets, or SEA-specific social or cultural AI challenges.

This alone does not qualify you for the program.

However, speaking a language from Southeast Asia can help, especially if it informs your research interests. We encourage you to highlight any relevant language skills and how they connect to your research goals in your application.

No, applications are individual. However, during the selection process, we may consider team dynamics and try to group mentees with complementary skills and interests.

Prior research experience or publications are not required but can strengthen your application. We value strong foundational knowledge in ML, potential, motivation, and fit with project topics and mentors more than credentials. If you have relevant experience, be sure to highlight it in your application.

We expect mentees to commit at least 10 hours per week (hard minimum). This includes time for meetings, research, coding, writing, and collaboration with mentors and peers. The recommended commitment is 15–20 hours per week for a more immersive experience.

We understand that mentees may have other commitments. The program is designed to be flexible and part-time. We recommend discussing your availability and commitments with your mentors and team members to ensure a manageable workload.

However, we cannot accommodate mentees who cannot commit the minimum required time.

The SEACrowd Apprentice Program is a remote research program.

Yes, the program is completely free to join for all selected mentees. We provide compute necessary for research projects.

The program is unpaid. However, mentees gain valuable research experience, mentorship, networking opportunities, and potential publication avenues.

Unfortunately, we do not have funds to provide stipends or financial support at this time. We recommend seeking external funding sources, scholarships, or institutional support if needed. Otherwise, we welcome you to apply in future cohorts when you have the necessary resources or when we may have funding available.

We use English as the primary language for all program communications, documentation, and deliverables. Communication channels will be team dependent, but we expect teams to primarily use Discord for day-to-day communication, and Google Meet / Zoom / Microsoft Teams for meetings. How frequent meetings are will be up to each team to decide, but likely weekly or bi-weekly check-ins with mentors.

No, mentees can only be accepted to work on one project per cohort to ensure focus and commitment.

Once accepted, mentees are expected to commit to their assigned project for the duration of the program. Switching projects is generally not allowed, as it can disrupt team dynamics and project continuity. If you have significant concerns, please discuss them with the program organizers.

For the 2026 cohort, we are only accepting applications for the predefined research topics listed on the program page. You can suggest ideas for the topic that you’ve chosen, but you cannot propose an entirely new project. The topics typically come from our mentors and organizers based on research gaps in Southeast Asia and their expertise, so we can ensure quality mentorship and project scope.

However, we encourage you to suggest new project ideas for future cohorts by reaching out to us via email at seacrowd.research@gmail.com. You can also reach out to other mentors & collaborators in research communities like Cohere Labs Open Science Community, Eleuther AI, Masakhane, or Nous Research.

While we encourage and support publication efforts, we cannot guarantee that every project will result in a publication. Successful publication depends on various factors, including the quality of the research, relevance to conference/journal themes, and acceptance by peer reviewers.

Yes, all mentees who complete the program will receive a Certificate of Achievement.

Strong contributors may receive letters of recommendation from their mentors upon successful completion of the program. This is typically reserved for mentees who demonstrate significant effort, growth, and contribution to their projects.

We expect you to retain your commitment to the project, but we understand that unforeseen circumstances may arise. If you need to leave the program early, please inform your mentors and the program organizers as soon as possible. We encourage open communication to manage expectations and ensure a smooth transition for your team.

No. Author list is agreed upon the beginning of the project. Anyone who doesn’t contribute to the paper will not be on the author list.

If you’re not selected for this cohort, we encourage you to apply again in future cohorts. We also recommend joining our Google Group and following us on X/Twitter, Facebook, or LinkedIn to stay updated on future opportunities.

Email us at seacrowd.research@gmail.com or join our Discord and ask in the #apprentice-program channel.

Past Apprentice Research

Projects

Multilingual Agentic for Underrepresented Regions

Build an environment and evaluation benchmark for agentic LLMs in low-resource languages and underrepresented regions.

CoRaL: Contextual Relevance and Linguistic Enrichment

A multi-dimensional data curation framework to balance quality, relevance, and cultural coverage in low-resource corpora.

Reasoning Agentic LLM Router

Develop skill-based routing to reduce inference costs while preserving strong generalization.

Selective Memory Layer Finetuning

Explore memory-layer finetuning strategies to improve continual learning without catastrophic forgetting.

Knowledge Distillation in Multilingual Vision-Text Model

Distill compact multilingual vision-text embeddings from large multimodal teachers for real-world deployment.

Publications

    Projects

    Language Surgery in Multilingual Large Language Models

    Mentors: Samuel Cahyawijaya, Genta Indra Winata, Fajri Koto, Peerat Limkonchotiwat, Alham Fikri Aji

    A technique for controlling language use in multilingual LLMs without retraining.

    Entropy2Vec: Crosslingual Language Modeling Entropy as End-to-End Learnable Language Representations

    Mentors: Alham Fikri Aji, Genta Indra Winata, Fajri Koto, Samuel Cahyawijaya

    Learning language embeddings from model entropy to recover typological structure.

    SEADialogues: A Multilingual Culturally Grounded Multi-turn Dialogue Dataset on Southeast Asian Languages

    Mentors: Genta Indra Winata, Ekapol Chuangsuwanich, Alham Fikri Aji, Fajri Koto, Peerat Limkonchotiwat

    A culturally grounded dialogue dataset and benchmark for SEA languages.

    Publications

    1. Joanito Agili Lopo, Muhammad Ravi Shulthan Habibi, Tack Hwa Wong, Muhammad Ilham Ghozali, Fajri Koto, Genta Indra Winata, Peerat Limkonchotiwat, Alham Fikri Aji, Samuel Cahyawijaya
      Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025) • 2025
      Abstract

      Large Language Models (LLMs) have demonstrated remarkable generalization capabilities across tasks and languages, revolutionizing natural language processing. This paper investigates the naturally emerging representation alignment in LLMs, particularly in the middle layers, and its implications for disentangling language-specific and language-agnostic information. We empirically confirm the existence of this alignment, analyze its behavior in comparison to explicitly designed alignment models, and demonstrate its potential for language-specific manipulation without semantic degradation. Building on these findings, we propose Inference-Time Language Control (ITLC), a novel method that leverages latent injection to enable precise cross-lingual language control and mitigate language confusion in LLMs. Our experiments highlight ITLC’s strong cross-lingual control capabilities while preserving semantic integrity in target languages. Furthermore, we demonstrate its effectiveness in alleviating the cross-lingual language confusion problem, which persists even in current large-scale LLMs, leading to inconsistent language generation. This work advances our understanding of representation alignment in LLMs and introduces a practical solution for enhancing their monolingual and cross-lingual performance.

      DOI
    2. Patrick Amadeus Irawan, Ryandito Diandaru, Belati Jagad Bintang Syuhada, Randy Zakya Suchrady, Alham Fikri Aji, Genta Indra Winata, Fajri Koto, Samuel Cahyawijaya
      Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025) • 2025
      Abstract

      We introduce Entropy2Vec, a novel framework for deriving cross-lingual language representations by leveraging the entropy of monolingual language models. Unlike traditional typological inventories that suffer from feature sparsity and static snapshots, Entropy2Vec uses the inherent uncertainty in language models to capture typological relationships between languages. By training a language model on a single language, we hypothesize that the entropy of its predictions reflects its structural similarity to other languages: Low entropy indicates high similarity, while high entropy suggests greater divergence. This approach yields dense, non-sparse language embeddings that are adaptable to different timeframes and free from missing values. Empirical evaluations demonstrate that Entropy2Vec embeddings align with established typological categories and achieved competitive performance in downstream multilingual NLP tasks, such as those addressed by the LinguAlchemy framework.

      DOI
    3. Muhammad Dehan Al Kautsar, Aswin Candra, Muhammad Alif Al Hakim, Maxalmina Satria Kahfi, Fajri Koto, Alham Fikri Aji, Peerat Limkonchotiwat, Ekapol Chuangsuwanich, Genta Indra Winata
      Preprint • 2025
      Abstract

      Although numerous datasets have been developed to support dialogue systems, most existing chit-chat datasets overlook the cultural nuances inherent in natural human conversations. To address this gap, we introduce SEADialogues, a culturally grounded dialogue dataset centered on Southeast Asia, a region with over 700 million people and immense cultural diversity. Our dataset features dialogues in eight languages from six Southeast Asian countries, many of which are low-resource despite having sizable speaker populations. To enhance cultural relevance and personalization, each dialogue includes persona attributes and two culturally grounded topics that reflect everyday life in the respective communities. Furthermore, we release a multi-turn dialogue dataset to advance research on culturally aware and human-centric large language models, including conversational dialogue agents.

      DOI PDF Resources