ISSN :2582-9793

Enhancing Commit Message Categorization in Open-Source Repositories Using Structured Taxonomy and Large Language Models

Original Research (Published On: 09-Dec-2024 )
Enhancing Commit Message Categorization in Open-Source Repositories Using Structured Taxonomy and Large Language Models
DOI : https://dx.doi.org/10.54364/AAIML.2024.44171

Prof. Muna Al-Razgan, Manal AlAqil, Ruba Almuwayshir and Zamzam Alhijji

Adv. Artif. Intell. Mach. Learn., 4 (4):2950-2968

Prof. Muna Al-Razgan : Department of Software Engineering, College of Computer and Information Sciences, King Saud University, Riyadh, 11345, Saudi Arabia

Manal AlAqil : Department of Software Engineering, College of Computer and Information Sciences, King Saud University, Riyadh, 11345, Saudi Arabia

Ruba Almuwayshir : Department of Software Engineering, College of Computer and Information Sciences, King Saud University, Riyadh, 11345, Saudi Arabia

Zamzam Alhijji : Department of Software Engineering, College of Computer and Information Sciences, King Saud University, Riyadh, 11345, Saudi Arabia

Download PDF Here

DOI: https://dx.doi.org/10.54364/AAIML.2024.44171

Article History: Received on: 28-Sep-24, Accepted on: 17-Nov-24, Published on: 09-Dec-24

Corresponding Author: Prof. Muna Al-Razgan

Email: malrazgan@ksu.edu.sa

Citation: Muna Al-Razgan, Manal Alaqil, Ruba Almuwayshir, Zamzam Alhijji. (SAUDI ARABIA) (2024). Enhancing Commit Message Categorization in Open-Source Repositories Using Structured Taxonomy and Large Language Models. Adv. Artif. Intell. Mach. Learn., 4 (4 ):2950-2968


Abstract

    

Version control systems (VCS) manage source code changes by storing modifications in a database. A key feature of VCS is the commit function, which saves the project's current state and summarizes changes through Commit Message (CM). These messages are vital for collaboration, particularly in open-source artificial intelligence (AI) projects on platforms, where contributors work on rapidly evolving codebases. This paper presents an empirical analysis of CM within open-source AI repositories on GitHub, focusing on their content, the effectiveness of categorization by Large Language Models (LLMs), and the impact of message quality on categorization accuracy. A sample of 384 CMs from 34 repositories was manually categorized to establish a taxonomy. Python was then used for automated keyword extraction, refined with regex patterns. Also, an experiment involved assessing the performance of ChatGPT-4 in categorizing CMs, first without guidance and later using our developed taxonomy. Our findings indicate that the quality of CMs varies greatly, which has a clear impact on how efficiently they can be categorized. This study contributes to the field by providing a structured taxonomy of CMs and exploring how tools like ChatGPT-4 can be used to analyze them. The insights from this research are intended to benefit both academic studies and real-world software development, particularly by helping teams better understand and automate the handling of CM in AI projects.

Statistics

   Article View: 515
   PDF Downloaded: 4