Microsoft is expanding multilingual LLMs in a new partnership across Europe. Currently, Europe's 24 official and 250 indigenous languages are underrepresented in web content for LLM training. To address this, Microsoft will collaborate with Hugging Face to provide multilingual data from GitHub. A call for applications will be issued to support content development in 10 underrepresented languages. Initiatives will include recording audio, creating digitized content, and supporting research. This data will be publicly accessible to European citizens, aiming to improve AI capabilities in these languages.
We have learnt that, basically, one needs to record several hundred hours of people, speaking a particular language in order to support the multi-modal capability of AI.
It's important to underscore that all of this work is designed to donate more data so that others can use it.
Collection
[
|
...
]