Vyākarana: A Colorless Green Benchmark for Syntactic Evaluation in Indic Languages

Abstract

While there has been significant progress towards developing NLU datasets and benchmarks for Indic languages, syntactic evaluation has been relatively less explored. Unlike English, Indic languages have a rich morphosyntax, grammatical genders, free linear word-order, and a highly inflectional morphology. In this paper, we introduce Vyākarana: a benchmark dataset of Colorless Green sentences in Indic languages for syntactic testing of multilingual language models. We use the dataset to probe four multilingual language models: mBERT, DistilmBERT, XLM-R, and IndicBERT for syntax in Indic languages. In our experiments, we report the results of layer-wise probing for four syntax-related tasks: PoS Tagging, Syntax Tree-depth Prediction, Grammatical Case Marking, and Subject-Verb Agreement. Our results show that the language models trained with Indic languages exclusively do not capture syntax as efficiently as the other highly multilingual language models.

Publication
arXiv