Niyati Bafna

About

I'm a third year PhD student at the Center for Language and Speech Processing at Johns Hopkins University, advised by Professor David Yarowsky. In the past, I worked at ALMAnaCH, INRIA in Paris as a Research Engineer with Benoît Sagot and Rachel Bawden. Even before that, I graduated from the EMLCT Masters' program as an Erasmus scholar, with a dual MSc. in Computational Linguistics and NLP at Charles University, Prague (first year) and Saarland University, Saarbrücken (second year). I'm interested in building NLP tools for text and speech that are available for all the world's languages in their dialectal, colloquial, and code-switched variants :)

Publications

See ACL Anthology or Google Scholar.

Niyati Bafna, Tianjian Li, Kenton Murray, David R. Mortensen, David Yarowsky, Hale Sirin, Daniel Khashabi. The Translation Barrier Hypothesis: Multilingual Generation with Large Language Models Suffers from Implicit Translation Failure .
Niyati Bafna and Matthew Wiesner. 2025. LID Models are Actually Accent Classifiers: Implications and Solutions for LID on Accented Speech . Proc. Interspeech 2025, 1488-1492.
Niyati Bafna, Emily Chang, Nathaniel Romney Robinson, David R. Mortensen, Kenton Murray, David Yarowsky, and Hale Sirin. 2025. DialUp! Modeling the Language Continuum by Adapting Models to Dialects and Dialects to Models. . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 20188–20233, Vienna, Austria. Association for Computational Linguistics.
Niyati Bafna, Kenton Murray, and David Yarowsky. 2024. Evaluating Large Language Models along Dimensions of Language Variation: A Systematik Invesdigatiom uv Cross-lingual Generalization. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18742–18762, Miami, Florida, USA. Association for Computational Linguistics.
Niyati Bafna, Philipp Koehn, and David Yarowsky. 2024. Pointer-Generator Networks for Low-Resource Machine Translation: Don’t Copy That! In Proceedings of the Fifth Workshop on Insights from Negative Results in NLP, pages 60–72.
Niyati Bafna, Cristina España-Bonet, Josef van Genabith, Benoît Sagot, and Rachel Bawden. 2024. When Your Cousin Has the Right Connections: Unsupervised Bilingual Lexicon Induction for Related Data-Imbalanced Languages. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 17544–17556, Torino, Italia. ELRA and ICCL.
Niyati Bafna, Cristina España-Bonet, Josef van Genabith, Benoît Sagot, and Rachel Bawden. 2023. Cross-Lingual Strategies for Low-Resource Language Modeling: A Study on Five Indic Dialects. In Proceedings of the 18th Conference on Traitement Automatique des Language Naturelles. Paris, France. TALN.
Niyati Bafna, Josef van Genabith, Cristina España-Bonet, and Zdeněk Žabokrtský. 2022. Combining Noisy Semantic Signals with Orthographic Cues: Cognate Induction for the Indic Dialect Continuum. In Proceedings of the 26th Conference on Computational Natural Language Learning, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
Niyati Bafna and Zdeněk Žabokrtský. 2022. Subword-based Cross-lingual Transfer of Embeddings from Hindi to Marathi and Nepali. In Proceedings of the 19th SIGMORPHON Workshop on Computational Research in Phonetics, Phonology, and Morphology, pages 61–71, Seattle, Washington. Association for Computational Linguistics.
Kartik Sharma*, Niyati Bafna*, and Samar Husain. 2021. Clause Final Verb Prediction in Hindi: Evidence for Noisy Channel Model of Communication. In Proceedings of the Workshop on Cognitive Modeling and Computational Linguistics, pages 160–170, Online. Association for Computational Linguistics.
Zdeněk Žabokrtský, Niyati Bafna, Jan Bodnár, Lukáš Kyjánek, Emil Svoboda, Magda Ševčíková, and Jonáš Vidra. 2022. Towards Universal Segmentations: UniSegments 1.0. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 1137–1149, Marseille, France. European Language Resources Association.
Niyati Bafna, Martin Vastlik, and Ondrej Bojar. 2021. Constrained Decoding for Technical Term Retention in English-Hindi MT. In Proceedings of the 18th International Conference on Natural Language Processing, pages 1–6, National Institute of Technology Silchar, India. NLP Association of India.
Niyati Bafna and Dipti Sharma. 2019. Towards Handling Verb Phrase Ellipsis in English-Hindi Machine Translation. In Proceedings of the 16th International Conference on Natural Language Processing, pages 150–159, International Institute of Information Technology, Hyderabad, India. NLP Association of India.

News

May 2025 I'm interning at Meta FAIR (again) to work on low-resource machine translation.

May 2025 Paper accepted to ACL '25 main: Check out our technique for inducing dialectal robustness in pretrained models! DialUp! Modeling the Language Continuum by Adapting Models to Dialects and Dialects to Models . Also, check out this nice pip package dialup that lets you do fun things like artificial dialect generation and making your MT systems robust to dialectal variation!

May 2025 Paper accepted to Interspeech '25: LID Models are Actually Accent Classifiers: Implications and Solutions for LID on Accented Speech .

April 2025 Gave an invited talk at Schmidt Sciences AI and Advanced Computing: Dialectal robustness in Large Language Models.

Sept 2024 New paper accepted to EMNLP '24 main: Evaluating Large Language Models along Dimensions of Language Variation: A Systematik Invesdigatiom uv Cross-lingual Generalization . See you in Miami :-)

May 2024 Our work on Unsupervised Bilingual Lexicon Induction for Related Data-Imbalanced Languages won the COLING Best Student Paper Award at LREC-COLING!
Feb 2024 I'll be interning at Seamless Meta in Menlo Park over the summer.
June 2023 Presented my work, called Cross-Lingual Strategies for Low-Resource Language Modeling: A Study on Five Indic Dialects, at TALN 2023, Paris, France. It's about comparing basic multilingual strategies for language modelling, for truly low-resource languages that belong to the same dialect continuum or language family.
April 2023 Accepted my PhD offer at JHU CSLP, and will be starting in Fall 2023. I'll be advised by Professor David Yarowsky.
Oct 2022 Invited talk at Linguistic Mondays, Institute of Formal and Applied Linguistics, Charles University, on two of my recent works in experiments on Indic languages: Empirical Models for an Indic Dialect Continuum
Oct 2022 New paper accepted at CoNLL 2022! This paper was adapted from my M.Sc. thesis; it is about data collection for 26 dialects and languages of the Indic language continuum, along with strategies for cognate induction for these languages as a step towards building bilingual resources for (extremely) low resouce languages.
Oct 2022 Starting as a research engineer at ALMAnaCH, INRIA in Paris, with Benoît Sagot and Rachel Bawden; super excited :)
Aug 2022 Defended my thesis (twice), graduated from Charles University and Saarland University! I did my thesis jointly with the MLT group at DFKI and UFAL. I was supervised by Prof. Josef van Genabith and Cristina España-Bonet from the former and Zdeněk Žabokrtský from the latter. The thesis is about cognate induction and data collection for 26 (extremely) low resourced languages of the Indic dialect continuum; check it out here: Empirical Models for an Indic Language Continuum!
Jul 2022 New paper at SIGMORPHON@NAACL '22, about subword level embeddings transfer from Hindi to Marathi and Nepali.
Apr 2022 New paper at LREC '22 (the UniSegements project) with UFAL, harmonizing different morphological resources for 17 languages. I worked on Hindi, Marathi, Malayalam, Tamil, and Bengali.

Expandable Dropdown List Example

CV

Here's a PDF version of all of this stuff.

Research Interests

There are 3800+ written languages in the world, with varying levels of resourcedness. Given the LLM paradigm that powers everything these days, making NLP massively multilingual has two broad facets: enabling LLMs to comprehend content and instructions in a low-resource language, and teaching it to generate accurate, useful, and fluent content in a low-resource language (LRL). Here are some of the kinds of problems in this space:

Answering the (age-old) cascade question: At the frontier of NLP today, we're able to do quasi-magical things in a few high-resource languages, especially English. For the rest of the languages in the world, our tools lag behind: they are more likely to produce wrong-language text, and their responses are less accurate. Given complex problems, they are less likely to make it through a series of logical steps correctly. They are less easily controlled for toxicity and safety. What we do have for these languages is good quality machine translation (well, better quality). Instead of attempting to induce native capabilities for all the above in LLMs for the range of mid-resource languages - why not translate inputs into English, let LLMs do what they know best, and then translate English outputs back into the target language? What do we lose by this cascaded approach, and what do we gain, and can we quantify these gains and losses? What does it mean for the kinds of resources we try to collect in LRLs, and for the tools we try to build for them? Should we redirect our energy towards building MT specialised in LLM outputs, instead of training LLMs in various languages?

Reasoning and inference-time scaling in a multilingual context: Once our LLM is pretained, we have a number of ways of making it better at complex tasks. A desired answer may not be straightforwardly derivable from the question for multiple reasons: the task itself might be difficult to understand, it may require a few logical hops, it may require a non-trivial specialised computation, or it may require consideration of several perspectives. Strategies such as few-shot prompting with in-context learning, chain-of-thought elicitation (and its multiple variants), tool use, and multi-agent debate, are methods of searching a solution space for a desired answer in response to the above problems. However, they are all worse in multilingual settings. (This by itself is somewhat interesting: if LLMs are doing all their reasoning in English, as some papers indicate they are, this shouldn't be the case.) I'm interested in gaining insights on this phenomenon of degradation - is it best characterised as coming from divergent base-model processes, shallower fluency and articulation problems, lack of post-training exposure, or something else? - and I'm interested in developing solutions to mitigate it.

Tool use for multilinguality: LLMs are still often bad at comprehension and generation for several mid- to low-resource languages. For these languages, we may want to leverage the reasoning capabilities of LLMs but outsource linguistic processing and generation to specialized modules. What are the best architectural and training strategies for these modules and their integration with the LLM pipeline?

Better machine translation: Yep, MT is not solved. MT is a moving target even for mid-to-high resource languages, because our horizons are constantly expanding with what we expect from NLP. Today we want to do fine-grained arguments, long coherent documents, literary generation, abstract correction. Even more audaciously: we want to do MT for *all* languages - we want to take lexicons and grammar books and show them to LLMs, and ask LLMs to translate Kalamang for us. How do we evolve MT for mid-resource languages to adapt it to new horizons? How do we teach LLMs to translate into and from unseen languages? How do we make all this robust and reliable? Further, MT evaluation very quickly becomes a problem: how do you holistically evaluate MT for a language you can only do string matching in? I'm interested in exploring MT catered to the frontier of NLP, for languages of various resource levels.

(And of course) Dialectal generalisation, cross-lingual transfer and its limits: Many dialects and languages of the world exist across a continuum, with varying degrees of resourcedness at various points. Can we model this continuum in a manner that can help NLP tools fill in the gaps"? How can we most effectively use its properties to leverage good datasets and models at certain points on this continuum for others? Many things about linguistic divergence are systematic, meaning that models can extrapolate performance to low-resource languages that are close enough to high-resource languages in regular ways. However, languages also have irregular, language-specific phenomena. Can we theoretically quantify the limits of cross-lingual transfer for a given language family? Can we identify the kinds of phenomena that cannot be transferred, so that we can evolve targeted solutions for them?

I also have a particular interest in NLP for Indian languages! Also - besides all this stuff, people seem to be talking a fair bit about agentic stuff these days. I should figure out what's up with that.

For Fun...

When I'm not working, I enjoy playing tennis, dancing WCS, salsa, and bachata, solving cryptic crosswords, reading, and writing. I'm currently learning Telugu, and trying not to forget my French. Putting that here for accountability. I also enjoy the occasional game of rapid chess, and am a fan of the Sicilian Dragon.

Other fun things about me: I played basketball for my city as a teenager, my FIDE classical chess rating is 1609, and I spent the summer of 2019 translating a book from Hindi to English. And I have a 1-year+ Spanish streak on Duolingo, and speak no Spanish.

Get in touch!

If you're interested in any of the stuff I talked about, don't hesitate to reach out :-)