I'm interested in making LLMs better at multilinguality. There are several dimensions to this problem:
-
01
Evaluating or improving — We could focus on evaluating existing capabilities, or on improving them.
-
02
Comprehension or generation — Understanding multilingual inputs and generating multilingual outputs are different goals, often needing different kinds of approaches.
-
03
Resourcedness of language — Some languages are practically unseen to models with little or no data, others are low-resource with very little data, and then there are mid-resource languages with decent existing LLM support. Many low-resource languages also have dialectal relationships with related high-resource languages, which can be exploited in several ways.
-
04
Techniques — Finally, there are a number of ways to tackle these problems.
Here's a schema of my work categorised along the above dimensions!