randfeatures
Generate randomized subset of features
Syntax
[IDX, Z] = randfeatures(X, Group, '
PropertyName
', PropertyValue
...)
randfeatures(..., 'Classifier', C)
randfeatures(..., 'ClassOptions', CO)
randfeatures(..., 'PerformanceThreshold',
PT)
randfeatures(..., 'ConfidenceThreshold',
CT)
randfeatures(..., 'SubsetSize', SS)
randfeatures(..., 'PoolSize', PS)
randfeatures(..., 'NumberOfIndices',
N)
randfeatures(..., 'CrossNorm', CN)
randfeatures(..., 'Verbose', VerboseValue)
Description
[IDX, Z] = randfeatures(X, Group, '
performs
a randomized subset feature search reinforced by classification. PropertyName
', PropertyValue
...)randfeatures
randomly
generates subsets of features used to classify the samples. Every
subset is evaluated with the apparent error. Only the best subsets
are kept, and they are joined into a single final pool. The cardinality
for every feature in the pool gives the measurement of the significance.
X
contains the training samples. Every column of X
is an
observed vector. Group
contains the class labels.
Group
can be a numeric vector, a cell array of character
vectors or string vector; numel(Group)
must be the same as the
number of columns in X
, and
numel(unique(Group))
must be greater than or equal to
2
. Z
is the classification significance
for every feature. IDX
contains the indices after sorting
Z
; i.e., the first one points to the most significant
feature.
randfeatures(..., 'Classifier', C)
sets
the classifier. Options are
'da' (default) Discriminant analysis 'knn' K nearest neighbors
randfeatures(..., 'ClassOptions', CO)
is
a cell with extra options for the selected classifier. When you specify
the discriminant analysis model ('da'
) as a classifier, randfeatures
uses
the classify
function with its
default parameters. For the KNN classifier, randfeatures
uses fitcknn
with the following default options. {'Distance','correlation','NumNeighbors',5}
.
randfeatures(..., 'PerformanceThreshold',
PT)
sets the correct classification threshold used to pick
the subsets included in the final pool. For the 'da'
model,
the default is 0.8
. For the 'knn'
model,
the default is 0.7
.
randfeatures(..., 'ConfidenceThreshold',
CT)
uses the posterior probability of the discriminant
analysis to invalidate classified subvectors with low confidence.
When using the 'da'
model, the default is 0.95.^(number
of classes)
. When using the 'knn'
model,
the default is 1, meaning any classified subvector must have all k neighbors
classified to the same class in order to be kept in the pool.
randfeatures(..., 'SubsetSize', SS)
sets
the number of features considered in every subset. Default is 20
.
randfeatures(..., 'PoolSize', PS)
sets
the targeted number of accepted subsets for the final pool. Default
is 1000
.
randfeatures(..., 'NumberOfIndices',
N)
sets the number of output indices in IDX
.
Default is the same as the number of features.
randfeatures(..., 'CrossNorm', CN)
applies
independent normalization across the observations for every feature.
Cross-normalization ensures comparability among different features,
although it is not always necessary because the selected classifier
properties might already account for this. Options are
'none' (default) Intensities are not cross-normalized. 'meanvar' x_new = (x - mean(x))/std(x) 'softmax' x_new = (1+exp((mean(x)-x)/std(x)))^-1 'minmax' x_new = (x - min(x))/(max(x)-min(x))
randfeatures(..., 'Verbose', VerboseValue)
,
when Verbose
is true
, turns
off verbosity. Default is true
.
Examples
Find a reduced set of genes that is sufficient for classification of all the cancer types in the t-matrix NCI60 data set. Load sample data.
load NCI60tmatrix
Select features.
I = randfeatures(X,GROUP,'SubsetSize',15,'Classifier','da');
Test features with a linear discriminant classifier.
C = classify(X(I(1:25),:)',X(I(1:25),:)',GROUP); cp = classperf(GROUP,C); cp.CorrectRate
ans = 1
References
[1] Li, L., Umbach, D.M., Terry, P., and Taylor, J.A. (2003). Application of the GA/KNN method to SELDI proteomics data. PNAS. 20, 1638-1640.
[2] Liu, H., Motoda, H. (1998). Feature Selection for Knowledge Discovery and Data Mining, Kluwer Academic Publishers.
[3] Ross, D.T. et.al. (2000). Systematic Variation in Gene Expression Patterns in Human Cancer Cell Lines. Nature Genetics. 24 (3), 227-235.
Version History
Introduced before R2006a
See Also
classperf
| crossvalind
| rankfeatures
| classify
| sequentialfs