About

I'm a second year PhD student at the Center for Language and Speech Processing at Johns Hopkins University, advised by Professor David Yarowsky. Before this, I worked at ALMAnaCH, INRIA in Paris, for a year, on investigating the behaviour of Transformer-based models on closely related dialects and languages. Even before that, I graduated from the EMLCT Masters' program as an Erasmus scholar, with a dual MSc. in Computational Linguistics at Charles University, Prague (first year), and Language and Science Technologies at Saarland University, Germany (second year). I'm interested in building NLP tools for text and speech that are available for all the world's languages in their dialectical, colloquial, and code-switched variants :)

Publications

See ACL Anthology or Google Scholar

News

  • Sept 2024 New paper accepted to EMNLP '24: Evaluating Large Language Models along Dimensions of Language Variation: A Systematik Invesdigatiom uv Cross-lingual Generalization . See you in Miami :-)

  • May 2024 Our work on Unsupervised Bilingual Lexicon Induction for Related Data-Imbalanced Languages won the COLING Best Student Paper Award at LREC-COLING!

  • Feb 2024 I'll be interning at Seamless Meta in Menlo Park over the summer; excited!

  • June 2023 Presented my work, called Cross-Lingual Strategies for Low-Resource Language Modeling: A Study on Five Indic Dialects, at TALN 2023, Paris, France. It's about comparing basic multilingual strategies for language modelling, for truly low-resource languages that belong to the same dialect continuum or language family.

  • April 2023 Accepted my PhD offer at JHU CSLP, and will be starting in Fall 2023! I'll be advised by Professor David Yarowsky.

  • Oct 2022 Invited talk at Linguistic Mondays, Institute of Formal and Applied Linguistics, Charles University, on two of my recent works in experiments on Indic languages: Empirical Models for an Indic Dialect Continuum

  • Oct 2022 New paper accepted at CoNLL 2022! This paper was adapted from my M.Sc. thesis; it is about data collection for 26 dialects and languages of the Indic language continuum, along with strategies for cognate induction for these languages as a step towards building bilingual resources for (extremely) low resouce languages.

  • Oct 2022 Starting as a research engineer at ALMAnaCH, INRIA in Paris, with Benoît Sagot and Rachel Bawden; super excited :)

  • Aug 2022 Defended my thesis (twice), graduated from Charles University and Saarland University! I did my thesis jointly with the MLT group at DFKI and UFAL. I was supervised by Prof. Josef van Genabith and Cristina España-Bonet from the former and Zdeněk Žabokrtský from the latter. The thesis is about cognate induction and data collection for 26 (extremely) low resourced languages of the Indic dialect continuum; check it out here: Empirical Models for an Indic Language Continuum!

  • Jul 2022 New paper at SIGMORPHON@NAACL '22, about subword level embeddings transfer from Hindi to Marathi and Nepali.

  • Apr 2022 New paper at LREC '22 (the UniSegements project) with UFAL, harmonizing different morphological resources for 17 languages. I worked on Hindi, Marathi, Malayalam, Tamil, and Bengali.

Expandable Dropdown List Example

CV

Here's a PDF version of all of this stuff.

Research Interests

I'm interested in low-resource settings and domain generalization for multilingual and dialectical NLU, machine translation, ASR, and LID, as well as linguistic interpretability of multilingual speech and text language models. Here are some of the kinds of problems that interest me.

Cross-lingual transfer: Many dialects and languages of the world exist across a continuum, with varying degrees of resourcedness at various points. Can we model this continuum in a manner that can help NLP tools "fill in the gaps"? How can we most effectively use its properties to leverage good datasets and models at certain points on this continuum for others?

Properties of data: If we can only create bitext for a fixed number sentences in a low-resource language, what should the morphosyntactic and semantic properties of those sentences be for the best chance at generalisation? How much value does some new set of monolingual sentences add to existing corpora for a language? (And the speech version of this problem!)

ASR, speech representations: Speech tokenization matters! Do hidden unit representations need to be phonologically sensible to generalize multilingually? I'm also interested in domain generalization for ASR (example scenario: we have good general ASR, and we want to use it for a meeting about something niche, like, bioluminescence in Photinus pyralis. We have a lexicon containing words we might expect. How can we best feed such priors into the model at test time?)

NMT with help: In extremely low-resource settings, where we lack large bitext, it may be easier to create resources such as bilingual lexicons and tools like morphological reinflectors. How can these be integrated into current Transformer-based NMT paradigms?

For Fun...

When I'm not working, I enjoy playing tennis, solving cryptic crosswords, reading about politics/history, writing, salsa and bachata dancing, and learning languages! I also enjoy the occasional game of rapid chess, and am a fan of the Sicilian Dragon.

*And* I have a lifelong desire to sing acapella but have never actually tried it.

Get in touch!

If you're interested in any of the stuff I talked about, don't hesitate to reach out :-)