Sarfaraz Ahmed Mohammed and Anca Ralescu
Adv. Artif. Intell. Mach. Learn., 3 (3):1494-1525
Sarfaraz Ahmed Mohammed : College of Engineering and Applied Science
Anca Ralescu : Department of Computer Science, College of Engineering and Applied Science
Article History: Received on: 22-Jul-23, Accepted on: 23-Sep-23, Published on: 30-Sep-23
Corresponding Author: Sarfaraz Ahmed Mohammed
Citation: Sarfaraz Ahmed Mohammed, Senuka Abeysinghe, Anca Ralescu (2023). Feature Selection and Comparative Analysis of Breast Cancer Prediction using Clinical Data and Histopathological Whole Slide Images. Adv. Artif. Intell. Mach. Learn., 3 (3 ):1494-1525
Breast Carcinoma is a common cancer among women, with invasive ductal carcinoma and lobular carcinoma being the two most frequent types. Early detection is critical to prevent cancer from becoming malignant. Diagnostic tests include mammogram, ultrasound, MRI, or biopsy. Machine Learning algorithms can play a key role in analyzing complex clinical datasets to predict disease outcomes. This study uses machine learning and deep learning techniques to analyze publicly available clinical and medical image data. For clinical data, Principal Component Analysis (PCA) and Particle Swarm Optimization (PSO) are applied on the Wisconsin Breast Cancer dataset (WDBC) for feature selection and evaluate the performance of each modality in distinguishing between benign and malignant tumors. The results obtained show that the Random Forest (RF) classifier outperforms other classification algorithms using both PSO and PCA feature selections, achieving predictive accuracies of 95.7% and 97.2% respectively. The first part of the paper contains a comprehensive analysis of the two feature selection methods on clinical data to optimize predictive performance. The second part of the paper is concerned with image data. Although Histopathological Whole Slide Imaging (WSI) has been validated for a variety of pathological applications for over two decades of manual detection of cancerous tumors, it remains challenging and prone to human error. With the potential of deep learning models to aid pathologists in detecting cancer subtypes, and the increasing predictive ability of current image analysis techniques in identifying the underlying genomic data and cancer-causing mutations, the second half of the paper focusses on feature extraction using a deep convolutional neural network (U-Net) trained on WSI’s from The Cancer Genome Atlas (TCGA) to accurately classify and extract relevant features. The focus is on feature extraction, nuclei-based instance segmentation, H&E-stained image extraction, and quantifying intensity information for a given WSI to classify the disease type. A comprehensive analysis of feature selection methods is presented for both clinical and medical image data.