Sin-Ho Jung
Adv. Artif. Intell. Mach. Learn., 4 (1):1834-1846
Sin-Ho Jung : Professor of Biostatistics & Bioinformatics, Biostatistics & Bioinformatics, Basic Science Departments
DOI: https://dx.doi.org/10.54364/AAIML.2024.41106
Article History: Received on: 12-May-23, Accepted on: 23-Jan-24, Published on: 30-Jan-24
Corresponding Author: Sin-Ho Jung
Email: sinho.jung@duke.edu
Citation: Lu Liu and Sin-Ho Jung (2024). Efficient Use of Data for Prediction and Validation. Adv. Artif. Intell. Mach. Learn., 4 (1 ):1834-1846
Prediction model building is one of the most important tasks in analysis of high-dimensional data. A fitted prediction model
should be validated for future use. So, when conducting such an analysis, we have to use the whole data for both training and
validation. When using a hold-out method, the fitted prediction model will be more efficient if the training set is bigger, but the
validation power will be lower with a smaller validation set. In order to balance the efficiency of fitted prediction model and its
validation, 50-50 allocation of the whole data set is popularly used as a hold-out method. In prediction and validation procedure,
we have to use the information embedded in the whole data set as efficiently as possible. As a such effort, cross-validation
methods (CV) have been very popular these days. In a CV method, a large portion of the data set is used for training and
the remaining small portion of the data is used for validation, and this procedure is repeated until the whole data points
are used for validation. In a CV method, each data point is used for both training and validation, so that as the portion of
training set is increased, the efficiency of training will be increased, while the validation power will be decreased due to the
increased over-fitting, i.e. more frequent use of each data point for training. As another effort of efficient use of the whole
data, we propose to use the whole data set for both training and validation, called 1-fold CV method. By using the whole
data to fit a prediction model, training efficiency will be highest, but, by reusing the whole data set for validation, its validation
power is expected to be very low. The validation power of CV methods will be estimated by permutation methods. Through
extensive simulation studies and real data analysis, we find that the newly proposed 1-fold CV method uses the whole data
very efficiently.