ISSN :2582-9793

Efficient Use of Data for Prediction and Validation

Original Research (Published On: 30-Jan-2024 )
Efficient Use of Data for Prediction and Validation
DOI : https://dx.doi.org/10.54364/AAIML.2024.41106

Sin-Ho Jung

Adv. Artif. Intell. Mach. Learn., 4 (1):1834-1846

Sin-Ho Jung : Professor of Biostatistics & Bioinformatics, Biostatistics & Bioinformatics, Basic Science Departments

Download PDF Here

DOI: https://dx.doi.org/10.54364/AAIML.2024.41106

Article History: Received on: 12-May-23, Accepted on: 23-Jan-24, Published on: 30-Jan-24

Corresponding Author: Sin-Ho Jung

Email: sinho.jung@duke.edu

Citation: Lu Liu and Sin-Ho Jung (2024). Efficient Use of Data for Prediction and Validation. Adv. Artif. Intell. Mach. Learn., 4 (1 ):1834-1846

          

Abstract

    

Prediction model building is one of the most important tasks in analysis of high-dimensional data. A fitted prediction model should be validated for future use. So, when conducting such an analysis, we have to use the whole data for both training and validation. When using a hold-out method, the fitted prediction model will be more efficient if the training set is bigger, but the validation power will be lower with a smaller validation set. In order to balance the efficiency of fitted prediction model and its validation, 50-50 allocation of the whole data set is popularly used as a hold-out method. In prediction and validation procedure, we have to use the information embedded in the whole data set as efficiently as possible. As a such effort, cross-validation methods (CV) have been very popular these days. In a CV method, a large portion of the data set is used for training and the remaining small portion of the data is used for validation, and this procedure is repeated until the whole data points are used for validation. In a CV method, each data point is used for both training and validation, so that as the portion of training set is increased, the efficiency of training will be increased, while the validation power will be decreased due to the increased over-fitting, i.e. more frequent use of each data point for training. As another effort of efficient use of the whole data, we propose to use the whole data set for both training and validation, called 1-fold CV method. By using the whole data to fit a prediction model, training efficiency will be highest, but, by reusing the whole data set for validation, its validation power is expected to be very low. The validation power of CV methods will be estimated by permutation methods. Through extensive simulation studies and real data analysis, we find that the newly proposed 1-fold CV method uses the whole data very efficiently.

Statistics

   Article View: 834
   PDF Downloaded: 39