Malik Muhammad Saad Missen, Asma Amjad, Muzamil Malik and Hassan Taimour Khan
Adv. Artif. Intell. Mach. Learn., XX (XX):-
1. Malik Muhammad Saad Missen: The Islamia University of Bahawalpur Pakistan
2. Asma Amjad: Dept of Information Technology The Islamia University of Bahawalpur
3. Muzamil Malik: Dept of Computer Science, Hamdard University Islamabad
4. Hassan Taimour Khan: Dept of Technology Management, The Islamia University of Bahawalpur
DOI: 10.54364/AAIML.2026.63307
Article History: Received on: 26-Feb-26, Accepted on: 19-May-26, Published on: 26-May-26
Corresponding Author: Malik Muhammad Saad Missen
Email: saad.missen@gmail.com
Citation: Asma Amjad, et al. Bi-Contextual Retrieval Augmented Generation (RAG) for Automatic Descriptive Answer Grading. Advances in Artificial Intelligence and Machine Learning. 2026. (Ahead of Print) https://dx.doi.org/10.54364/AAIML.2026.63307
Automatic Short Answer Grading
(ASAG) is a well-known research task in the field of natural language
processing (NLP). Its major purpose is to automatically grade descriptive
answers of the students by keeping automatic grading consistent with the
evaluation of human graders. Recent developments in Large Language Models
(LLMs) have demonstrated a greatly enhanced performance in automated grading; however,
the generalizability of the models and accuracy is still quite low because of
the absence of dataset-specific grounding. We present EDURAG, a
Retrieval-Augmented Generation (RAG) based model to improve contextualization
of the LLM-based grading with exemplar-based grading and extra knowledge as
generated by QFKE (Question Focused Knowledge Extraction) module. The proposed
QFKE module provides extra layer of contextuality for the EDURAG. In contrast
to conventional supervised methods, EDURAG does not need model fine-tuning. The
suggested framework is tested against the ASAG2024 benchmark that consolidates
seven short-answer grading datasets across various domains, educational levels,
and grading scales. The benchmark protocol of measuring performance is weighted
Root Mean Square Error (wRMSE). The experimental findings show that dual
contextuality provided by EDURAG enhances the accuracy of grading significantly
when compared to vanilla LLM grading. The ablation study also confirms the significance
of dual contextuality provided by EDURAG. Although promising results (almost 15%
improvement) have been achieved during experiments, there is still a
discrepancy between human grading performance and merit, indicating the
potential of hybrid human-AI grading systems. The results indicate that
retrieval-enhanced LLMs offer a scalable and generalizable direction for
automated assessment.