With nearly 1.4 billion people, India is the second-most populated country in the world. Yet Indian languages, like Hindi and Tamil, are underrepresented on the web. Popular Natural Language Understanding (NLU) models perform worse with Indian languages compared to English, the effects of which lead to subpar experiences in downstream web applications for Indian users. With more attention from the Kaggle community and your novel machine learning solutions, we can help Indian users make the most of the web.
Predicting answers to questions is a common NLU task, but not for Hindi and Tamil. Current progress on multilingual modeling requires a concentrated effort to generate high-quality datasets and modelling improvements. Additionally, for languages that are typically underrepresented in public datasets, it can be difficult to build trustworthy evaluations. We hope the dataset provided for this competition—and additional datasets generated by participants—will enable future machine learning for Indian languages.
In this competition, your goal is to predict answers to real questions about Wikipedia articles. You will use chaii-1, a new question answering dataset with question-answer pairs. The dataset covers Hindi and Tamil, collected without the use of translation. It provides a realistic information-seeking task with questions written by native-speaking expert data annotators. You will be provided with a baseline model and inference code to build upon.
If successful, you’ll improve upon the baseline performance of NLU models in Indian languages. The results could improve the web experience for many of the nearly 1.4 billion people of India. Additionally, you’ll contribute to multilingual NLP, which could be applied beyond the languages in this competition.
Google Research India contributes fundamental advances in computer science and applies their research to big problems impacting India, Google, and communities around the world. The Natural Language Understanding group at Google Research India works specifically with ML to address the unique challenges in the Indian context (such as code mixing in Search, diversity of languages, dialects and accents in Assistant), learning from limited resources and advancing multilingual models.
chaii (Challenge in AI for India) is a Google Research India initiative created with the purpose of sparking AI applications to address some of the pressing problems in India and to find unique ways to address them. Starting with a focus on NLU, chaii hopes to make progress towards multilingual modelling, as language diversity is significantly underserved on the web. Google Research India is working on transformational approaches to healthcare, agriculture and education, and also improving apps and services such as search, assistant and payments, e.g., to deal with challenges arising out of the diversity of languages in India. We also acknowledge the support from the AI4Bharat Team at the Indian Institute of Technology Madras.
- 1st Place – USD$2,000
- 2nd Place – USD$2,000
- 3rd Place – USD$2,000
- 4th Place – USD$2,000
- 5th Place – USD$2,000