LILA Lab — Language Intelligence for Low-resource Applications

Languages We Serve

Every language deserves to be visible in the data that shapes global decisions. We're building infrastructure for ten underserved languages — and counting.

বাংলা

Bangla

265M speakers · BENI pipeline active

Active

অসমীয়া

Assamese

15M speakers

Seeking Contributors

नेपाली

Nepali

25M speakers

Seeking Contributors

ꠍꠤꠟꠐꠤ

Sylheti

11M speakers

Seeking Contributors

চাঁটগাঁইয়া

Chittagonian

16M speakers

Seeking Contributors

هَوُسَ

Hausa

80M speakers

Planned

Kiswahili

Swahili

100M speakers

Planned

Tiếng Việt

Vietnamese

100M speakers

Planned

Tagalog

Filipino

80M speakers

Planned

Bahasa Indonesia

Indonesian

200M speakers

Planned

🪶

Your Language?

We're ready when you are

Be First

Start your pipeline →

The Lab in Numbers

Proven in Bangla — scaling to ten languages by 2027

0

Languages targeted

0

Articles processed & classified

0

Classification accuracy (TF-IDF)

0

Research papers in series

0

CPI correlation (p < 0.001, validated)

How XENI Works

Every language gets its own XENI — [X]ploration & E Native-language Intelligence. The first letter changes per language. Proven in Bangla as BENI.

1

📰

Collect Native News

Aggregate millions of native-language articles from local sources, archives, and feeds — preserving linguistic authenticity from day one.

2

🤖

Annotate & Classify

LLM ensemble (Claude, GPT-4o) annotates narratives across domains. Multi-model classification achieves 91.7% accuracy on economic narratives.

3

📊

Build Validated Index

Construct monthly narrative indices validated against macroeconomic indicators. BENI's Economic Index correlates with CPI at r=−0.75 (p < 0.001).

The XENI Naming System

The pattern teaches itself. First letter = language. XENI = pipeline. Domain = index.

BENI = Bangla

AENI = Assamese

NENI = Nepali

YENI = Yoruba

?ENI Your language

Your language's initial + ENI = your pipeline →

The XENI suffix always refers to the pipeline. The domain is a plain-English qualifier on the index it produces — e.g., BENI Economic Index, BENI Health Index, BENI Climate Index. Same pipeline, many indices per language.

The XENI Pipeline Family

Each pipeline is named [Language initial] + ENI. Proven in Bangla, ready for your language.

B

BENI — Bangla Exploration & Native-language Intelligence

Active 265M speakers

The first complete XENI pipeline. From raw Bangla news to a validated macroeconomic narrative index — proven against CPI, FX, and foreign reserves. Open-source, reproducible, and ready to adapt for your language.

664K+

Articles classified

91.7%

Accuracy (TF-IDF)

r=−0.75

CPI correlation

79 mo.

Monthly index span

Explore BENI on GitHub →

A

AENI

অসমীয়া — Assamese

Seeking Contributors

N

NENI

नेपाली — Nepali

Seeking Contributors

S

SENI

ꠍꠤꠟꠐꠤ — Sylheti

Seeking Contributors

C

CENI

চাঁটগাঁইয়া — Chittagonian

Seeking Contributors

H

HENI

هَوُسَ — Hausa

Planned

KI

KIENI

Kiswahili — Swahili

Planned

VI

VIENI

Tiếng Việt — Vietnamese

Planned

TI

TIENI

Tagalog — Filipino

Planned

ID

IDENI

Bahasa Indonesia

Planned

🪶

[?]ENI — Your Language Here

The pipeline is language-agnostic. If you speak it, we can process it. Fork the repo, adapt the template, publish the paper.

Start Your XENI →

LILA Technical Reports

A 6-paper research program on narrative measurement across underserved languages — from statistical foundations to LLM-based measurement devices.

#1

Statistical Economics of Narrative

A quantitative framework for narrative-based economic analysis. Foundations of the methodology.

Complete

#2

Systematic Review of Economic Narrative Indices

Systematic review, replication study, and Bangla extension (2007–2025).

Submitted to arXiv

#3

Building BENI: A Replicable Pipeline

From raw news to validated narrative index — the complete methodology and technical architecture.

Active (Jul 2026)

#4

Nowcasting Inflation with BENI

Local-language news as a high-frequency economic indicator for inflation prediction.

Planned (Aug 2026)

#5

Text as Data in Social Science

110-year survey of language-based methods: from content analysis to LLMs.

Planned (Oct 2026)

#6

LLMs as Measurement Devices

Framework for narrative extraction and measurement in low-resource languages.

Proposed (Jan 2027)

Browse All Reports →

Eight Ways to Contribute

Every contribution model leads to academic authorship. If you speak an underserved language, you are not a data source — you are a co-author.

🌍

Language Extension

Apply the pipeline to YOUR language. First-author paper.

🔬

Cross-Domain

Apply to health, climate, education. First-author paper.

⚙️

Methodological

Improve the classifier, reduce cost. Co-authorship.

✅

Replication

Independently verify results. Published replication report.

🗣️

Citizen Annotation

Label articles in your language. Acknowledgement in papers.

📊

Policy Brief

Analyze narratives for policy. Co-authorship.

🛠️

Infrastructure

Build dashboards, APIs, tools. Tool paper co-authorship.

📖

Education

Create tutorials, course modules. Educational paper co-authorship.

Read the Full Framework →

Join the Lab

A contributor knowledge base spanning economics, linguistics, and Git — with a clear roadmap for getting involved.

📈

Economic Foundations

Why narratives matter for economic measurement and policy.

Narrative Economics (Shiller, 2017) — how stories spread like viruses and drive economic fluctuations. The theoretical basis for why we track narrative prevalence in news.
Text as Data (Gentzkow, Kelly & Taddy, 2019) — converting unstructured text into quantitative measures. From bag-of-words to LLM-based embeddings.
Nowcasting — using high-frequency non-traditional data (news, search trends) to predict official statistics before they're released.
Economic Complexity & Fingerprint — each economy leaves a distinctive narrative fingerprint. The BENI approach captures that fingerprint through domain-specific narrative classification.

🗣️

Linguistics & NLP

How we process low-resource languages at scale.

Low-Resource NLP — languages with limited labeled data, tools, and pre-trained models. 7,000+ languages worldwide; fewer than 100 have any NLP support.
Multilingual Transformers — mBERT, XLM-R, BanglaBERT, sahajBERT. Cross-lingual transfer learning enables progress where direct data is scarce.
Annotation Theory — schema design, inter-annotator agreement, adjudication. Our LLM ensemble (Claude + GPT-4o) achieves human-level reliability at 5–20× lower cost.
Script & Tokenization — Bangla script (Bengali-Assamese), Sylheti Nagri, Devanagari, Arabic script variations. Each writing system presents unique tokenization challenges for transformer models.

⌨️

Git & GitHub Management

How we organize, version, and collaborate across pipelines.

Monorepo Structure — all pipelines, datasets, and documentation in one repository. Every XENI pipeline follows the same directory template for discoverability.
Branch Strategy — feature branches from main, CI/CD via GitHub Actions, auto-deploy to GitHub Pages. Pre-commit hooks enforce linting and formatting.
Issue & Project Boards — each pipeline tracked as a GitHub Project. Labels: language/*, domain/*, paper/*, good first issue.
Contribution Model — fork & PR workflow. All contributors credited in the registry. Academic co-authorship guaranteed for substantive contributions.

🔍 Looking For

Domain Experts

We're actively seeking collaborators with domain expertise. No NLP background needed — we handle the technical pipeline. You bring the subject-matter knowledge.

📈 Economists Validate narrative indices against macroeconomic indicators

🌍 Linguists Design annotation schemas for low-resource languages

🏥 Health Specialists Build the BENI Health Index with domain-specific labels

🌿 Climate Researchers Define climate narrative categories for emerging economies

📊 Social Scientists Design validation frameworks and policy briefs

Reach Out →

Project Roadmap

LILA Lab builds toward 10 underserved languages by H1 2027. Here is the path.

Complete

BENI Pipeline & Baseline Papers

664K Bangla articles collected, annotated (Claude + GPT-4o ensemble), and classified. TF-IDF baseline: 91.7% accuracy. BENI Economic Index validated against CPI (r = −0.75, p < 0.001). Papers #1 and #2 complete, #3 active.

Active

Unified Corpus & 6-Model Benchmark

Unified 933K-article corpus built. TF-IDF on unified corpus: 94.77% accuracy. Six BanglaBERT models queued for Kaggle GPU training. Paper #3 (Building BENI) in progress.

Q3 2026

Multi-Domain Expansion

Extend BENI beyond Economics: Health (BENI Health Index), Climate (BENI Climate Index), Education (BENI Education Index). Each domain needs annotation schema design and validation data. Paper #4 (Nowcasting) and #5 (Text as Data survey).

Q4 2026

Sister Pipelines: AENI, NENI, SENI, CENI

Bootstraps for Assamese, Nepali, Sylheti, and Chittagonian pipelines. Each requires native-speaking contributors, dataset collection, annotation schema adaptation, and local validation. GitHub Project boards created for each.

H1 2027

African & Southeast Asian Expansion

HENI (Hausa, 80M speakers), KIENI (Swahili, 100M), VIENI (Vietnamese, 100M), TIENI (Filipino, 80M), IDENI (Indonesian, 200M). Target: 10-language XENI family complete. Paper #6 (LLMs as Measurement Devices).

First Steps

Start contributing today — no prior NLP experience required.

1

Fork the Repository

Fork LilaLABx/LILA-LAB on GitHub, clone locally, and run pip install -e ".[all]". Read the Contributing Guide for environment setup.

2

Pick an Entry Point

Browse the good first issues or choose a language pipeline that matches your background. Linguists can contribute annotation schemas; developers can build infrastructure; economists can design validation frameworks.

3

Read the Knowledge Base

Study the three pillars above. The technical reports provide full methodological depth. The language registry tracks all pipeline statuses.

4

Join the Community

Introduce yourself on Discord. Tell us your language, your domain interest, and how you'd like to contribute. Every contributor — technical or not — is credited in our registry.

Fork the Repo → Join Discord

"

84% of NLP research is English-only. If your language isn't served, you're invisible in the data that shapes global decisions. We change that — one pipeline at a time.

— LILA Lab

Connect with LILA Lab

Follow LILA Lab across platforms — all coordinated from the repository. Join the movement for language infrastructure.

🐦 X / Twitter @LILA_Lab → 💼 LinkedIn /company/lila-lab → ▶️ YouTube @LILA_Lab → 💬 Discord discord.gg/TrrdKbky → 📰 Substack lila.substack.com → ⌨️ GitHub LilaLABx/LILA-LAB → 📄 OSF Open Science Framework → 🤗 Hugging Face nabil0x → 👤 Facebook LILALabResearch →

All channels are documented and coordinated from the Communications Center.