Apply to our SEACrowd 2026 Apprentice Program!
SEACrowd 2026 Apprentice Program

SEACrowd 2026 Apprentice Program

Apply here by Dec 17, 2025 (UTC-12).

SEACrowd Apprentice Program is a remote research program (Feb 1 - Jun 30, 2026) that pairs experienced researchers with early-career researchers across Southeast Asia to build models, datasets, and publishable research. Small, mentored teams work on scoped projects with structured milestones and community support, creating a clear path toward PhD admissions, jobs, and stronger SEA regional capacity.

Each mentee will join a team of 2–3 mentors (at least 1 primary mentor and 1 secondary mentor) and 2+ fellow mentees to execute a research project over four months. The program emphasizes hands-on experience, mentorship, and peer learning, culminating in a research report, which can turn into paper submission to top AI/ML/NLP conferences as arranged by the team.

2025 Success Story

Our 2025 cohort featured three apprentice teams who successfully completed their projects, culminating in mentees’ first-authored research papers published at the 5th Multilingual Representation Learning Workshop (2025). Check out their publications!

We shared their journey and our learnings from running the first cohort in our Retrospective blog post.

2026 Research Topics

We offer five cutting-edge research projects:

  1. Multilingual Agentic for Underrepresented Regions - Developing evaluation frameworks for agent-based language models in low-resource languages
  2. CoRaL: Contextual Relevance and Linguistic Enrichment - Multi-dimensional data curation framework for low-resource language training corpora
  3. Reasoning Agentic LLM Router - Skill-based routing mechanisms to reduce inference costs while maintaining performance
  4. Selective Memory Layer Finetuning - Architectural solutions for continual learning using memory layers
  5. Knowledge Distillation in Multilingual Vision-Text Models - Creating compact vision-language embeddings for edge devices

View detailed project topics →

Who Can Apply

There are no formal eligibility or age limits. We’re a growth-first programme and value potential, motivation, and effort more than credentials.

We welcome anyone who meets at least one of the following:

and share our vision.

What Increases Your Chances

What You’ll Gain

Check out previous batch publications!

Application Process

Program Schedule

Publications are encouraged when ready, not tied to specific conference deadlines.

Who You’ll Work With

Primary Mentors

Alham Fikri Aji

Alham Fikri Aji

MBZUAI, Asst. Prof.
Samuel Cahyawijaya

Samuel Cahyawijaya

Cohere, Member of Technical Staff
Ekapol Chuangsuwanich

Ekapol Chuangsuwanich

Chulalongkorn University
Fajri Koto

Fajri Koto

MBZUAI, Asst. Prof.
Peerat Limkonchotiwat

Peerat Limkonchotiwat

AI Singapore, Research Fellow
Genta Indra Winata

Genta Indra Winata

Capital One AI Foundation, Senior Applied Scientist

Secondary Mentors

Farid Adilazuarda

Farid Adilazuarda

University of Edinburgh, PhD Student
M. Dehan Al-Kautsar

M. Dehan Al-Kautsar

MBZUAI, NLP Researcher
David Anugraha

David Anugraha

Stanford University, MS Student
Patomporn Payoungkhamdee

Patomporn Payoungkhamdee

VISTEC, PhD Student
M. Reza Qorib

M. Reza Qorib

NUS, Research Fellow
Pume Tuchinda

Pume Tuchinda

VISTEC, Research Assistant

Organizers & Research Managers

Aye Hninn Khine

Aye Hninn Khine

KMUTT, SEACrowd
Holy Lovenia

Holy Lovenia

SEACrowd

Past Apprentice Research

Teams

Language Surgery in Multilingual Large Language Models

Mentors: Samuel Cahyawijaya, Genta Indra Winata, Fajri Koto, Peerat Limkonchotiwat, Alham Fikri Aji
Mentees: Joanito Agili Lopo, Muhammad Ravi Shulthan Habibi, Tack Hwa Wong, Muhammad Ilham Ghozali

Entropy2Vec: Crosslingual Language Modeling Entropy as End-to-End Learnable Language Representations

Mentors: Alham Fikri Aji, Genta Indra Winata, Fajri Koto, Samuel Cahyawijaya
Mentees: Patrick Amadeus Irawan, Ryandito Diandaru, Belati Jagad Bintang Syuhada, Randy Zakya Suchrady

SEADialogues: A Multilingual Culturally Grounded Multi-turn Dialogue Dataset on Southeast Asian Languages

Mentors: Genta Indra Winata, Ekapol Chuangsuwanich, Alham Fikri Aji, Fajri Koto, Peerat Limkonchotiwat
Mentees: Muhammad Dehan Al Kautsar, Aswin Candra, Muhammad Alif Al Hakim, Maxalmina Satria Kahfi

Publications

  1. Joanito Agili Lopo, Muhammad Ravi Shulthan Habibi, Tack Hwa Wong, Muhammad Ilham Ghozali, Fajri Koto, Genta Indra Winata, Peerat Limkonchotiwat, Alham Fikri Aji, Samuel Cahyawijaya
    Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025) • 2025
    Abstract

    Large Language Models (LLMs) have demonstrated remarkable generalization capabilities across tasks and languages, revolutionizing natural language processing. This paper investigates the naturally emerging representation alignment in LLMs, particularly in the middle layers, and its implications for disentangling language-specific and language-agnostic information. We empirically confirm the existence of this alignment, analyze its behavior in comparison to explicitly designed alignment models, and demonstrate its potential for language-specific manipulation without semantic degradation. Building on these findings, we propose Inference-Time Language Control (ITLC), a novel method that leverages latent injection to enable precise cross-lingual language control and mitigate language confusion in LLMs. Our experiments highlight ITLC’s strong cross-lingual control capabilities while preserving semantic integrity in target languages. Furthermore, we demonstrate its effectiveness in alleviating the cross-lingual language confusion problem, which persists even in current large-scale LLMs, leading to inconsistent language generation. This work advances our understanding of representation alignment in LLMs and introduces a practical solution for enhancing their monolingual and cross-lingual performance.

    DOI
  2. Patrick Amadeus Irawan, Ryandito Diandaru, Belati Jagad Bintang Syuhada, Randy Zakya Suchrady, Alham Fikri Aji, Genta Indra Winata, Fajri Koto, Samuel Cahyawijaya
    Proceedings of the 5th Workshop on Multilingual Representation Learning (MRL 2025) • 2025
    Abstract

    We introduce Entropy2Vec, a novel framework for deriving cross-lingual language representations by leveraging the entropy of monolingual language models. Unlike traditional typological inventories that suffer from feature sparsity and static snapshots, Entropy2Vec uses the inherent uncertainty in language models to capture typological relationships between languages. By training a language model on a single language, we hypothesize that the entropy of its predictions reflects its structural similarity to other languages: Low entropy indicates high similarity, while high entropy suggests greater divergence. This approach yields dense, non-sparse language embeddings that are adaptable to different timeframes and free from missing values. Empirical evaluations demonstrate that Entropy2Vec embeddings align with established typological categories and achieved competitive performance in downstream multilingual NLP tasks, such as those addressed by the LinguAlchemy framework.

    DOI
  3. Muhammad Dehan Al Kautsar, Aswin Candra, Muhammad Alif Al Hakim, Maxalmina Satria Kahfi, Fajri Koto, Alham Fikri Aji, Peerat Limkonchotiwat, Ekapol Chuangsuwanich, Genta Indra Winata
    Preprint • 2025
    Abstract

    Although numerous datasets have been developed to support dialogue systems, most existing chit-chat datasets overlook the cultural nuances inherent in natural human conversations. To address this gap, we introduce SEADialogues, a culturally grounded dialogue dataset centered on Southeast Asia, a region with over 700 million people and immense cultural diversity. Our dataset features dialogues in eight languages from six Southeast Asian countries, many of which are low-resource despite having sizable speaker populations. To enhance cultural relevance and personalization, each dialogue includes persona attributes and two culturally grounded topics that reflect everyday life in the respective communities. Furthermore, we release a multi-turn dialogue dataset to advance research on culturally aware and human-centric large language models, including conversational dialogue agents.

    DOI PDF Resources