About

I'm a second year PhD student at the Center for Language and Speech Processing at Johns Hopkins University, advised by Professor David Yarowsky. In the past, I worked at ALMAnaCH, INRIA in Paris with Benoît Sagot and Rachel Bawden. Even before that, I graduated from the EMLCT Masters' program as an Erasmus scholar, with a dual MSc. in Computational Linguistics at Charles University, Prague (first year), and Language and Science Technologies at Saarland University, Germany (second year). I'm interested in building NLP tools for text and speech that are available for all the world's languages in their dialectal, colloquial, and code-switched variants :)

Publications

See ACL Anthology or Google Scholar.

News

Expandable Dropdown List Example

CV

Here's a PDF version of all of this stuff.

Research Interests

I'm interested in mid- to low-resource settings and domain generalization for multilingual and dialectal natural language understanding, generation, machine translation, ASR, and LID. I'm also interested in benchmarking core linguistic capabilities in large language models in a robust and interpretable manner. Here are some of the kinds of problems in this space.

Cross-lingual transfer...: Many dialects and languages of the world exist across a continuum, with varying degrees of resourcedness at various points. Can we model this continuum in a manner that can help NLP tools "fill in the gaps"? How can we most effectively use its properties to leverage good datasets and models at certain points on this continuum for others?

...and the limits of cross-lingual transfer: Many things about linguistic divergence are systematic, meaning that models can extrapolate performance to low-resource languages that are close enough to high-resource languages in regular ways. However, languages also have irregular, language-specific phenomena. Can we theoretically quantify the limits of cross-lingual transfer for a given language family? Can we identify the kinds of phenomena that cannot be transferred, so that we can evolve targeted solutions for them?

Tool use for multilinguality: LLMs are still often bad at comprehension and generation for several mid- to low-resource languages. For these languages, we may want to leverage the reasoning capabilities of LLMs but outsource linguistic processing and generation to specialized modules. What are the best architectural and training strategies for these modules and their integration with the LLM pipeline?

ASR, speech representations: Speech tokenization matters! Do hidden unit representations need to be phonologically sensible to generalize multilingually? I'm also interested in domain generalization for ASR (example scenario: we have good general ASR, and we want to use it for a meeting about something niche, like, bioluminescence in Photinus pyralis. We have a lexicon containing words we might expect. How can we best feed such priors into the model at test time?)

For Fun...

When I'm not working, I enjoy playing tennis, solving cryptic crosswords, reading about politics/history, writing, salsa and bachata dancing, and learning languages! I also enjoy the occasional game of rapid chess, and am a fan of the Sicilian Dragon.

*And* I have a lifelong desire to sing acapella but have never actually tried it.

Get in touch!

If you're interested in any of the stuff I talked about, don't hesitate to reach out :-)