โœ… Proven Pipeline ๐Ÿ”ค Bangla (265M speakers) ๐Ÿ“Š 79-month index

Bangla Exploration & Native-language Intelligence

The first proven XENI pipeline โ€” from 664,000 Bangla news articles to a validated monthly narrative index that correlates with macroeconomic indicators.

91.7% classification accuracy 79 monthly observations r = โˆ’0.75 correlation with CPI 2 papers published

What is BENI?

BENI is the first end-to-end XENI pipeline โ€” an instrument that measures economic narratives in the Bangla language by turning news articles into quantitative data.

The Bangla Narrative Observatory

BENI collects news articles from six major Bangladeshi newspapers, annotates them for economic relevance using an LLM ensemble (Claude, GPT-4o), trains a classifier, and constructs a monthly narrative index.

The index is then validated against real-world macroeconomic indicators โ€” CPI inflation, exchange rates, and foreign exchange reserves โ€” proving that narrative measurements capture economically meaningful signal.

Bangla  ยท  Exploration  ยท  Native-language  ยท  Intelligence
The first of many XENI pipelines targeting 10 emerging-economy languages by 2027
664K
Bangla news articles collected
6
Major newspapers sourced
2014โ€“2020
Full date range of the corpus
265M
Bangla speakers served

How BENI works

Six stages turn raw Bangla news into a validated economic narrative index.

1
Article Collection
Contributors submit Bangla news URLs, forms, or corpus files. The Potrika corpus provides 664K articles from six major dailies โ€” Jugantor, Ittefaq, Kaler Kontho, Inqilab, Jaijaidin, Somoyer Alo.
39 CSV files ยท 3.3 GB
2
Bucket Building
Articles are batched into 120,707-article timeseries (Economy + sampled National/Politics/Worldnews), preserving publication dates for time-series-aware evaluation.
120K articles processed
3
LLM Annotation
Each article is independently annotated by Claude and GPT-4o. Disagreements are resolved via majority voting (adjudication). The ensemble approach improves reliability and provides natural confidence estimates.
Claude + GPT-4o ensemble
4
Human Review
Native speakers verify uncertain or borderline labels. A 300-article locked reference set serves as the gold standard for all future classifier evaluation.
300-article reference set
5
Classification & Index
A TF-IDF + logistic regression classifier (91.7% accuracy, 0.894 macro F1) predicts economic relevance for every article. Monthly aggregation produces the BENI Economic Index โ€” 79 months of narrative data.
91.7% accuracy
6
Macroeconomic Validation
The index is tested against CPI inflation (r = โˆ’0.75), BDT/USD exchange rate (r = โˆ’0.72), and foreign exchange reserves (r = โˆ’0.77) โ€” all statistically significant at p < 0.001.
p < 0.001

Key results

What BENI has achieved so far โ€” and what it makes possible for every language that follows.

๐ŸŽฏ
91.7%
Classification accuracy (TF-IDF + logistic regression)
๐Ÿ“Š
79
Monthly observations in the narrative index (2014โ€“2020)
๐Ÿ“ˆ
โˆ’0.75
Level correlation with CPI (p < 0.001)
๐Ÿ’ฑ
โˆ’0.72
Level correlation with BDT/USD FX (p < 0.001)
๐Ÿฆ
โˆ’0.77
Level correlation with FX reserves (p < 0.05)
๐Ÿ“„
2
Papers published (4 more in the pipeline)

BENI Economic Index

A monthly time series measuring the share of economic news in Bangla-language media.

What it measures

The BENI Economic Index tracks what proportion of Bangla news articles discuss the economy each month. At its core, it answers a simple question: how much is Bangladesh talking about the economy?

The classifier predicts an "economic probability" for every article in the corpus. Articles are grouped by month, and the index is the proportion with probability above 0.5. The result is a 79-month time series from June 2014 to December 2020.

Mean economic news share: 38.9% โ€” meaning roughly 2 in 5 Bangla news articles carry economic relevance.

Explore the pilot experiment โ†’

38.9%
Mean economic news share
79 mo.
Index duration (Jun 2014 โ€“ Dec 2020)
120,707
Articles classified for the timeseries
0.894
Macro F1 score (TF-IDF classifier)
On correlations: Level correlations are strong and significant (r = โˆ’0.75 with CPI, โˆ’0.72 with FX). Month-to-month (detrended) correlations are near zero โ€” suggesting the TF-IDF index captures long-run structural shifts, not short-term noise. The planned BanglaBERT upgrade may improve short-run signal.

Run BENI yourself

Clone the repo and be up and running in minutes.

๐Ÿ”ฌ

For Researchers

Train the baseline classifier and build the 79-month narrative index from scratch.

git clone https://github.com/LilaLABx/LILA-LAB.git
cd LILA-LAB
pip install -e ".[core]"
cd pipelines/BENI/experiment/beni_pilot
python3 train.py --task economic --model-type tfidf
python3 build_index.py
python3 correlate.py
๐Ÿ’ป

For Developers

Explore the full annotation pipeline, run LLM annotation, or fine-tune BanglaBERT.

cd pipelines/BENI/annotation
python3 llm_annotate.py --help
python3 run_model_comparison.py --help

# Set up the Discord bot
cd infrastructure/discord-bot
pip install -r requirements.txt
python bot.py
๐Ÿ—ฃ๏ธ

For Linguistic Contributors

No coding required. Help annotate Bangla articles, design schemas, or validate LLM outputs.

# Read the contribution guide
cat docs/LINGUISTIC_CONTRIBUTION_GUIDE.md

# Join the community
# discord.gg/TrrdKbky

BENI milestones

Where BENI has been and where it's going.

Pilot Complete

Complete

TF-IDF baseline trained, 79-month index built, correlations with CPI and FX validated. Proof of concept established.

Papers 1 & 2 Published

Complete

Statistical Economics of Narrative and Economic Narrative Indices: Systematic Review โ€” both complete and submitted.

Paper 3 โ€” BENI Method

Active

Building the full BENI pipeline paper. Documents the annotation methodology, classifier training, and index construction.

Paper 4 โ€” Nowcasting with BENI

Planned

Using the narrative index to nowcast inflation. Scheduled for August 2026.

Domain Expansion

Planned

Extend BENI to Health, Climate, and Education domains. Each domain needs its own annotation schema and validation data.

BanglaBERT Upgrade

Planned

Full fine-tuning of BanglaBERT on the 70K-article training set (Kaggle GPU). Expected to improve short-run signal and classification accuracy.

Papers & publications

BENI is the engine behind a 6-paper research series. Here's where its data and methodology appear.

Help build BENI

Multiple ways to contribute โ€” no coding required for linguistic contributions.

๐Ÿ—ฃ๏ธ

Linguistic Contributor

Annotate Bangla articles, design schemas, review LLM outputs. Your language expertise is the core ingredient.

Get started โ†’
โš™๏ธ

Methodological

Improve the classifier, reduce LLM costs, or design better validation approaches.

Collaboration โ†’
๐ŸŒ

Cross-Domain

Extend BENI to Health, Climate, or Education domains. New annotation schema = new index.

Propose a domain โ†’
โœ…

Replication

Independently verify BENI's results using the open-source code and published data.

Replicate โ†’
๐Ÿ“Š

Policy Brief

Analyze BENI's narrative data for policy insights and real-world applications.

Explore data โ†’
๐Ÿ› ๏ธ

Infrastructure

Build dashboards, APIs, or visualization tools for the BENI index.

Code โ†’

Ready to explore BENI?

Clone the repository, run the pipeline, or join the community.