ISSN :2582-9793

Bi-Contextual Retrieval Augmented Generation (RAG) for Automatic Descriptive Answer Grading

Original Research (Published On: 26-May-2026 )
DOI : https://doi.org/10.54364/AAIML.2026.63307

Malik Muhammad Saad Missen, Asma Amjad, Muzamil Malik and Hassan Taimour Khan

Adv. Artif. Intell. Mach. Learn., XX (XX):-

1. Malik Muhammad Saad Missen: The Islamia University of Bahawalpur Pakistan

2. Asma Amjad: Dept of Information Technology The Islamia University of Bahawalpur

3. Muzamil Malik: Dept of Computer Science, Hamdard University Islamabad

4. Hassan Taimour Khan: Dept of Technology Management, The Islamia University of Bahawalpur

Download PDF Here

DOI: 10.54364/AAIML.2026.63307

Article History: Received on: 26-Feb-26, Accepted on: 19-May-26, Published on: 26-May-26

Corresponding Author: Malik Muhammad Saad Missen

Email: saad.missen@gmail.com

Citation: Asma Amjad, et al. Bi-Contextual Retrieval Augmented Generation (RAG) for Automatic Descriptive Answer Grading. Advances in Artificial Intelligence and Machine Learning. 2026. (Ahead of Print) https://dx.doi.org/10.54364/AAIML.2026.63307


Abstract

    

Automatic Short Answer Grading (ASAG) is a well-known research task in the field of natural language processing (NLP). Its major purpose is to automatically grade descriptive answers of the students by keeping automatic grading consistent with the evaluation of human graders. Recent developments in Large Language Models (LLMs) have demonstrated a greatly enhanced performance in automated grading; however, the generalizability of the models and accuracy is still quite low because of the absence of dataset-specific grounding. We present EDURAG, a Retrieval-Augmented Generation (RAG) based model to improve contextualization of the LLM-based grading with exemplar-based grading and extra knowledge as generated by QFKE (Question Focused Knowledge Extraction) module. The proposed QFKE module provides extra layer of contextuality for the EDURAG. In contrast to conventional supervised methods, EDURAG does not need model fine-tuning. The suggested framework is tested against the ASAG2024 benchmark that consolidates seven short-answer grading datasets across various domains, educational levels, and grading scales. The benchmark protocol of measuring performance is weighted Root Mean Square Error (wRMSE). The experimental findings show that dual contextuality provided by EDURAG enhances the accuracy of grading significantly when compared to vanilla LLM grading. The ablation study also confirms the significance of dual contextuality provided by EDURAG. Although promising results (almost 15% improvement) have been achieved during experiments, there is still a discrepancy between human grading performance and merit, indicating the potential of hybrid human-AI grading systems. The results indicate that retrieval-enhanced LLMs offer a scalable and generalizable direction for automated assessment.

Statistics

   Article View: 7
   PDF Downloaded: 0