About

I'm a third year PhD student at the Center for Language and Speech Processing at Johns Hopkins University, advised by Professor David Yarowsky. In the past, I worked at ALMAnaCH, INRIA in Paris as a Research Engineer with Benoît Sagot and Rachel Bawden. Even before that, I graduated from the EMLCT Masters' program as an Erasmus scholar, with a dual MSc. in Computational Linguistics and NLP at Charles University, Prague (first year) and Saarland University, Saarbrücken (second year). I'm interested in building NLP tools for text and speech that are available for all the world's languages in their dialectal, colloquial, and code-switched variants :)

Publications

See ACL Anthology or Google Scholar.

News

Expandable Dropdown List Example

CV

Here's a PDF version of all of this stuff.

Research Interests

There are 3800+ written languages in the world, with varying levels of resourcedness. Given the LLM paradigm that powers everything these days, making NLP massively multilingual has two broad facets: enabling LLMs to comprehend content and instructions in a low-resource language, and teaching it to generate accurate, useful, and fluent content in a low-resource language (LRL). Here are some of the kinds of problems in this space:

Answering the (age-old) cascade question: At the frontier of NLP today, we're able to do quasi-magical things in a few high-resource languages, especially English. For the rest of the languages in the world, our tools lag behind: they are more likely to produce wrong-language text, and their responses are less accurate. Given complex problems, they are less likely to make it through a series of logical steps correctly. They are less easily controlled for toxicity and safety. What we do have for these languages is good quality machine translation (well, better quality). Instead of attempting to induce native capabilities for all the above in LLMs for the range of mid-resource languages - why not translate inputs into English, let LLMs do what they know best, and then translate English outputs back into the target language? What do we lose by this cascaded approach, and what do we gain, and can we quantify these gains and losses? What does it mean for the kinds of resources we try to collect in LRLs, and for the tools we try to build for them? Should we redirect our energy towards building MT specialised in LLM outputs, instead of training LLMs in various languages?

Reasoning and inference-time scaling in a multilingual context: Once our LLM is pretained, we have a number of ways of making it better at complex tasks. A desired answer may not be straightforwardly derivable from the question for multiple reasons: the task itself might be difficult to understand, it may require a few logical hops, it may require a non-trivial specialised computation, or it may require consideration of several perspectives. Strategies such as few-shot prompting with in-context learning, chain-of-thought elicitation (and its multiple variants), tool use, and multi-agent debate, are methods of searching a solution space for a desired answer in response to the above problems. However, they are all worse in multilingual settings. (This by itself is somewhat interesting: if LLMs are doing all their reasoning in English, as some papers indicate they are, this shouldn't be the case.) I'm interested in gaining insights on this phenomenon of degradation - is it best characterised as coming from divergent base-model processes, shallower fluency and articulation problems, lack of post-training exposure, or something else? - and I'm interested in developing solutions to mitigate it.

Tool use for multilinguality: LLMs are still often bad at comprehension and generation for several mid- to low-resource languages. For these languages, we may want to leverage the reasoning capabilities of LLMs but outsource linguistic processing and generation to specialized modules. What are the best architectural and training strategies for these modules and their integration with the LLM pipeline?

Better machine translation: Yep, MT is not solved. MT is a moving target even for mid-to-high resource languages, because our horizons are constantly expanding with what we expect from NLP. Today we want to do fine-grained arguments, long coherent documents, literary generation, abstract correction. Even more audaciously: we want to do MT for *all* languages - we want to take lexicons and grammar books and show them to LLMs, and ask LLMs to translate Kalamang for us. How do we evolve MT for mid-resource languages to adapt it to new horizons? How do we teach LLMs to translate into and from unseen languages? How do we make all this robust and reliable? Further, MT evaluation very quickly becomes a problem: how do you holistically evaluate MT for a language you can only do string matching in? I'm interested in exploring MT catered to the frontier of NLP, for languages of various resource levels.

(And of course) Dialectal generalisation, cross-lingual transfer and its limits: Many dialects and languages of the world exist across a continuum, with varying degrees of resourcedness at various points. Can we model this continuum in a manner that can help NLP tools fill in the gaps"? How can we most effectively use its properties to leverage good datasets and models at certain points on this continuum for others? Many things about linguistic divergence are systematic, meaning that models can extrapolate performance to low-resource languages that are close enough to high-resource languages in regular ways. However, languages also have irregular, language-specific phenomena. Can we theoretically quantify the limits of cross-lingual transfer for a given language family? Can we identify the kinds of phenomena that cannot be transferred, so that we can evolve targeted solutions for them?

I also have a particular interest in NLP for Indian languages! Also - besides all this stuff, people seem to be talking a fair bit about agentic stuff these days. I should figure out what's up with that.

For Fun...

When I'm not working, I enjoy playing tennis, dancing WCS, salsa, and bachata, solving cryptic crosswords, reading, and writing. I'm currently learning Telugu, and trying not to forget my French. Putting that here for accountability. I also enjoy the occasional game of rapid chess, and am a fan of the Sicilian Dragon.

Other fun things about me: I played basketball for my city as a teenager, my FIDE classical chess rating is 1609, and I spent the summer of 2019 translating a book from Hindi to English. And I have a 1-year+ Spanish streak on Duolingo, and speak no Spanish.

Get in touch!

If you're interested in any of the stuff I talked about, don't hesitate to reach out :-)