A Discretization-Based Feature Selection Method for Microarray Gene Expression Data Analysis pp. 317-328
Authors: (Saurabh Sarkar, Samuel H. Huang, School of Dynamic Systems, University of Cincinnati, Cincinnati, Ohio)
Abstract: This chapter presents a computationally efficient feature selection algorithm for binary classification problems involving datasets with a very large number of features. The algorithm can deal with both continuous and discrete data. It uses a sequential forward search method based on the analysis of the geometric location of data points from different classes. The main advantage of this approach is that it can provide a more refined view of the geometric distribution of data points from the two classes of interest and account for interactions between features, while preserving computational efficiency of simple methods such as univariate statistical analysis. The algorithm is applied to the analysis of a DNA microarray dataset with over 44,000 features. Two marker genes were identified. When used to develop classification models, the selected gene (feature) set yielded a higher 10-fold cross-validation accuracy compared to models using the complete feature set. This observation is consistent with previous studies and further supports the importance of feature selection in the derivation of predictive transcriptional signatures of cancer. The algorithm was also successfully tested on a Leukemia dataset with 7,129 features and 72 instances.