AI Singapore (AISG) and Google Research have embarked on Project Southeast Asian Languages in One Network Data (SEALD), a research collaboration to enhance datasets that can be used to train, fine-tune, and evaluate large language models (LLMs) in languages spoken across Southeast Asia (SEA).

This collaboration seeks to improve cultural context awareness and capabilities in SEA LLMs, and advance their applicability across the region to bring broad benefits to society, the duo said in a statement on Monday.

Starting with Indonesian, Thai, Tamil, Filipino, and Burmese, the research under Project SEALD will help build a diverse and high-quality data corpus of languages spoken in SEA to support the training of models under Southeast Asian Languages in One Network (SEA-LION) — an initiative by AISG to develop a family of LLMs specifically pre-trained and instruction-tuned to be more representative of SEA’s cultural contexts and linguistic nuances—and other models that can add value to SEA-centric use cases.

Under Project SEALD, AISG and Google Research Asia Pacific (APAC) will work together on developing trans localization and translation models; establishing best practices for instruction tuning datasets; creating tools to enable trans localization at scale, and publishing pre-training recipes for SEA languages.

AISG and Google will release the datasets and output from Project SEALD in open-source to advance the progress of the SEA LLM ecosystem and foster strong regional expertise.

As a specific use case, Project SEALD is working to improve communications with under-represented populations of migrant workers in Singapore, who may speak and understand a variety of regional languages with greater fluency than English.

Data collection efforts to better capture linguistic nuances within this community will provide the foundation for enhanced engagement by both the Singapore Government and employers.

When integrated into one of the generative artificial intelligence (AI) solutions first developed under the AI Trailblazers initiative by the Singapore Government and Google Cloud, the datasets and output from Project SEALD can aid outreach across a variety of important domains, such as redressal of worker grievances and extension of assistance schemes.

Lastly, Project SEALD will engage with ecosystem partners—academia, industry, and government—in various ways.

These include working with industry players for data collection, curation, and quality checks, collaborating with academia in different SEA countries to implement state-of-the-art techniques in evaluation and benchmarking, and partnering with government stakeholders in Singapore and across the region to advance use cases for public good.

Building on this, AISG is collaborating with Google Cloud to make its SEA-LION LLMs available on Google Cloud’s Model Garden on Vertex AI, which provides organizations with access to first-party, third-party, and open models that meet Google Cloud’s strict enterprise safety and quality standards.

Through Vertex AI, organizations can use enterprise-grade tools to easily customize these models to address relevant use cases and integrate them into their applications.

In addition, AISG will continue to make its SEA-LION LLMs available on Hugging Face, which has been partnering with Google Cloud to help developers train, tune, and serve open models quickly and cost-effectively.

AISG has also initiated collaborations across Singapore and other SEA countries.

“By focusing on languages spoken and used in SEA and cultural understanding, Project SEALD will significantly improve the existing corpus and evaluation benchmarks for these languages,

“This will open new opportunities and make AI more inclusive, accessible, and helpful for individuals and businesses throughout the region,” said Yolyn Ang, Vice President, Knowledge and Information Partnerships, Google APAC.

It is noted that in APAC, Google Research has a similar large-scale language inclusivity project ongoing in India with the Indian Institute of Science via Project Vaani—an initiative that is gathering, transcribing, and open-sourcing speech data from across all of India’s 773 districts.

“The SEA-LION LLM project has always been about building a community and ecosystem that will continuously work together to enhance the quality of the SEA-LION data corpus and continuously improve SEA-LION’s capabilities,

“We are happy that Google now stands as a key part of the SEA-LION ecosystem and we look forward to building better datasets through Project SEALD in collaboration with
Google for the benefit of the entire community,” said Leslie Teo, Senior Director of AI Products, AISG.

AISG is a national program launched by the National Research Foundation (NRF) to catalyze, synergize, and boost Singapore’s AI capabilities to power our future digital economy.

It is driven by a government-wide partnership comprising NRF, Smart Nation and Digital Government Office (SNDGO), Infocomm Media Development Authority (IMDA), Economic Development Board (EDB), Enterprise Singapore (EnterpriseSG), amongst others.

Singapore’s Ai Palette secures $5.8M in Series A1 funding led by Tin Men Capital