randfeatures

Generate randomized subset of features

Syntax

[IDX, Z] = randfeatures(X, Group, 'PropertyName', PropertyValue...) randfeatures(..., 'Classifier', C) randfeatures(..., 'ClassOptions', CO) randfeatures(..., 'PerformanceThreshold', PT) randfeatures(..., 'ConfidenceThreshold', CT) randfeatures(..., 'SubsetSize', SS) randfeatures(..., 'PoolSize', PS) randfeatures(..., 'NumberOfIndices', N) randfeatures(..., 'CrossNorm', CN) randfeatures(..., 'Verbose', VerboseValue)

Description

[IDX, Z] = randfeatures(X, Group, 'PropertyName', PropertyValue...) performs a randomized subset feature search reinforced by classification. randfeatures randomly generates subsets of features used to classify the samples. Every subset is evaluated with the apparent error. Only the best subsets are kept, and they are joined into a single final pool. The cardinality for every feature in the pool gives the measurement of the significance.

X contains the training samples. Every column of X is an observed vector. Group contains the class labels. Group can be a numeric vector, a cell array of character vectors or string vector; numel(Group) must be the same as the number of columns in X, and numel(unique(Group)) must be greater than or equal to 2. Z is the classification significance for every feature. IDX contains the indices after sorting Z; i.e., the first one points to the most significant feature.

randfeatures(..., 'Classifier', C) sets the classifier. Options are

'da'   (default)  Discriminant analysis
'knn'             K nearest neighbors

randfeatures(..., 'ClassOptions', CO) is a cell with extra options for the selected classifier. When you specify the discriminant analysis model ('da') as a classifier, randfeatures uses the classify function with its default parameters. For the KNN classifier, randfeatures uses fitcknn with the following default options. {'Distance','correlation','NumNeighbors',5}.

randfeatures(..., 'PerformanceThreshold', PT) sets the correct classification threshold used to pick the subsets included in the final pool. For the 'da' model, the default is 0.8. For the 'knn' model, the default is 0.7.

randfeatures(..., 'ConfidenceThreshold', CT) uses the posterior probability of the discriminant analysis to invalidate classified subvectors with low confidence. When using the 'da' model, the default is 0.95.^(number of classes). When using the 'knn' model, the default is 1, meaning any classified subvector must have all k neighbors classified to the same class in order to be kept in the pool.

randfeatures(..., 'SubsetSize', SS) sets the number of features considered in every subset. Default is 20.

randfeatures(..., 'PoolSize', PS) sets the targeted number of accepted subsets for the final pool. Default is 1000.

randfeatures(..., 'NumberOfIndices', N) sets the number of output indices in IDX. Default is the same as the number of features.

randfeatures(..., 'CrossNorm', CN) applies independent normalization across the observations for every feature. Cross-normalization ensures comparability among different features, although it is not always necessary because the selected classifier properties might already account for this. Options are

'none' (default)  Intensities are not cross-normalized.
'meanvar'         x_new = (x - mean(x))/std(x)  
'softmax'         x_new = (1+exp((mean(x)-x)/std(x)))^-1
'minmax'          x_new = (x - min(x))/(max(x)-min(x))

randfeatures(..., 'Verbose', VerboseValue), when Verbose is true, turns off verbosity. Default is true.

Examples

Find a reduced set of genes that is sufficient for classification of all the cancer types in the t-matrix NCI60 data set. Load sample data.

load NCI60tmatrix

Select features.

I = randfeatures(X,GROUP,'SubsetSize',15,'Classifier','da');

Test features with a linear discriminant classifier.

C = classify(X(I(1:25),:)',X(I(1:25),:)',GROUP);
cp = classperf(GROUP,C);
cp.CorrectRate

ans =

     1

References

[1] Li, L., Umbach, D.M., Terry, P., and Taylor, J.A. (2003). Application of the GA/KNN method to SELDI proteomics data. PNAS. 20, 1638-1640.

[2] Liu, H., Motoda, H. (1998). Feature Selection for Knowledge Discovery and Data Mining, Kluwer Academic Publishers.

[3] Ross, D.T. et.al. (2000). Systematic Variation in Gene Expression Patterns in Human Cancer Cell Lines. Nature Genetics. 24 (3), 227-235.

Version History

Introduced before R2006a