OpenAI has recently introduced IndQA, a benchmark specifically designed to evaluate the understanding and reasoning capabilities of large language models in the context of Indian languages and culture. This initiative is crucial for addressing a significant question: how can we effectively assess AI’s grasp of the linguistic and cultural nuances that shape everyday life in India?
Why IndQA Matters
Globally, around 80 percent of the population does not speak English as their primary language. Despite this, many existing benchmarks for non-English capabilities often rely on simplistic translation or multiple-choice formats. Current benchmarks, such as MMMLU and MGSM, have reached a saturation point where numerous strong models achieve similar scores. This situation makes it challenging to gauge meaningful advancements and does not accurately evaluate models based on local context and cultural understanding.
Dataset, Languages, and Domains
IndQA comprises 2,278 questions across 12 languages, specifically tailored to assess cultural and everyday knowledge relevant to India. The languages evaluated include:
- Bengali
- Hindi
- Hinglish
- Kannada
- Marathi
- Odia
- Telugu
- Gujarati
- Malayalam
- Punjabi
- Tamil
The benchmark covers 10 cultural domains:
- Architecture and Design
- Arts and Culture
- Everyday Life
- Food and Cuisine
- History
- Law and Ethics
- Literature and Linguistics
- Media and Entertainment
- Religion and Spirituality
- Sports and Recreation
Each question is accompanied by four components:
- A culturally grounded prompt in an Indian language
- An English translation for auditability
- Rubric criteria for grading
- An ideal answer that encapsulates expert expectations
Rubric-Based Evaluation Pipeline
IndQA employs a rubric-based grading approach rather than relying solely on exact match accuracy. For each question, domain experts define multiple criteria detailing what constitutes a strong answer, along with assigned weights for each criterion. This model-based grading allows for partial credit and captures cultural nuances in responses, providing a more comprehensive evaluation.
Construction Process and Adversarial Filtering
The construction process for the IndQA benchmark followed a four-step pipeline:
- Collaboration with Indian organizations to recruit native-level experts in various domains who authored culturally relevant prompts.
- Application of adversarial filtering, where draft questions were evaluated against OpenAI’s top models (GPT-4o, OpenAI o3, GPT-4.5, and later GPT-5). Only questions that received sub-par responses were retained, ensuring a clear distinction for future advancements.
- Expert-defined grading criteria created to evaluate each question, which are reused in assessing other models on IndQA.
- Experts crafted ideal answers and translations, undergoing peer review and iterative revisions to ensure quality.
Measuring Progress on Indian Languages
IndQA serves as a platform to evaluate recent frontier models and track advancements over recent years across Indian languages. Reportedly, model performance has significantly improved within IndQA, but substantial room for enhancement remains. Results are stratified by language and domain, providing comparisons with other frontier systems.
Key Takeaways
- IndQA is a culturally grounded Indic benchmark that focuses on how AI models understand and reason about culturally significant questions in Indian languages.
- The dataset, developed collaboratively with 261 domain experts, covers various aspects of Indian culture and consists of 2,278 well-structured questions across 12 languages.
- Evaluation is rubric-based, allowing for nuanced grading that embodies cultural correctness beyond simple token overlap.
- The questions have been adversarially filtered to ensure that they present a challenge for even the most advanced AI models.
Conclusion
IndQA represents a significant advancement in addressing the gaps associated with existing multilingual benchmarks, particularly for a linguistically and culturally diverse country like India. By utilizing expert-driven evaluation and targeted research, IndQA offers a robust framework for assessing language reasoning capabilities in AI systems.
FAQ
- What is IndQA? IndQA is a benchmark created by OpenAI to evaluate AI’s understanding of Indian languages and cultural nuances.
- How many languages does IndQA cover? IndQA covers 12 Indian languages, including Hindi, Bengali, and Tamil.
- What types of questions are included in IndQA? The benchmark includes 2,278 questions across various cultural domains relevant to India.
- How does IndQA evaluate AI responses? IndQA uses a rubric-based grading system that allows for partial credit and captures cultural nuances.
- Why is IndQA important? It addresses the need for effective assessment of AI models in non-English languages, particularly in culturally rich contexts like India.




























