Neighborhood Component Analysis (NCA) Feature Selection

Neighborhood component analysis (NCA) is a non-parametric method for selecting features with the goal of maximizing prediction accuracy of regression and classification algorithms. The Statistics and Machine Learning Toolbox™ functions fscnca and fsrnca perform NCA feature selection with regularization to learn feature weights for minimization of an objective function that measures the average leave-one-out classification or regression loss over the training data.

NCA Feature Selection for Classification

Consider a multi-class classification problem with a training set containing n observations:

$\begin{array}{l} S = {(x_{i}, y_{i}), i = 1, 2, \dots, n} \end{array},$

where $x_{i} \in ℝ^{p}$ are the feature vectors, $y_{i} \in {1, 2, \dots, c}$ are the class labels, and c is the number of classes. The aim is to learn a classifier $f : ℝ^{p} \to {1, 2, \dots, c}$ that accepts a feature vector and makes a prediction $f (x)$ for the true label $y$ of $x$ .

Consider a randomized classifier that:

Randomly picks a point, $Ref (x)$ , from $S$ as the ‘reference point’ for $x$
Labels $x$ using the label of the reference point $Ref (x)$ .

This scheme is similar to that of a 1-NN classifier where the reference point is chosen to be the nearest neighbor of the new point $x$ . In NCA, the reference point is chosen randomly and all points in $S$ have some probability of being selected as the reference point. The probability $P (Ref (x) = x_{j} | S)$ that point $x_{j}$ is picked from $S$ as the reference point for $x$ is higher if $x_{j}$ is closer to $x$ as measured by the distance function $d_{w}$ , where

$d_{w} (x_{i}, x_{j}) = \sum_{r = 1}^{p} w_{r}^{2} | x_{i r} - x_{j r} |,$

and $w_{r}$ are the feature weights. Assume that

$\begin{array}{l} P (Ref (x) = x_{j} | S) \propto k (d_{w} (x, x_{j})) \end{array},$

where $k$ is some kernel or a similarity function that assumes large values when $d_{w} (x, x_{j})$ is small. Suppose it is

$k (z) = \exp (- \frac{z}{σ}),$

as suggested in [1]. The reference point for $x$ is chosen from $S$ , so sum of $P (Ref (x) = x_{j} | S)$ for all j must be equal to 1. Therefore, it is possible to write

$\begin{array}{l} P (Ref (x) = x_{j} | S) = \frac{k (d_{w} (x, x_{j}))}{\sum_{j = 1}^{n} k (d_{w} (x, x_{j}))} \end{array} .$

Now consider the leave-one-out application of this randomized classifier, that is, predicting the label of $x_{i}$ using the data in $S^{- i}$ , the training set $S$ excluding the point $(x_{i}, y_{i})$ . The probability that point $x_{j}$ is picked as the reference point for $x_{i}$ is

$p_{i j} = P (Ref (x_{i}) = x_{j} | S^{- i}) = \frac{k (d_{w} (x_{i}, x_{j}))}{\sum_{j = 1, j \neq i}^{n} k (d_{w} (x_{i}, x_{j}))} .$

The average leave-one-out probability of correct classification is the probability $p_{i}$ that the randomized classifier correctly classifies observation i using $S^{- i}$ .

$\begin{array}{l} p_{i} = \sum_{j = 1, j \neq i}^{n} P (Ref (x_{i}) = x_{j} | S^{- i}) I (y_{i} = y_{j}) \end{array} = \sum_{j = 1, j \neq i}^{n} p_{i j} y_{i j},$

where

$y_{i j} = I (y_{i} = y_{j}) = {\begin{matrix} 1 & if y_{i} = y_{j,} \\ 0 & otherwise . \end{matrix}$

The average leave-one-out probability of correct classification using the randomized classifier can be written as

$F (w) = \frac{1}{n} \sum_{i = 1}^{n} p_{i} .$

The right hand side of $F (w)$ depends on the weight vector $w$ . The goal of neighborhood component analysis is to maximize $F (w)$ with respect to $w$ . fscnca uses the regularized objective function as introduced in [1].

$\begin{array}{l} F (w) & = \frac{1}{n} \sum_{i = 1}^{n} p_{i} - λ \sum_{r = 1}^{p} w_{r}^{2} \\ = \frac{1}{n} \sum_{i = 1}^{n} \underset{F_{i} (w)}{\underset{︸}{[\sum_{j = 1, j \neq i}^{n} p_{i j} y_{i j} - λ \sum_{r = 1}^{p} w_{r}^{2}]}} \\ = \frac{1}{n} \sum_{i = 1}^{n} F_{i} (w) \end{array},$

where $λ$ is the regularization parameter. The regularization term drives many of the weights in $w$ to 0.

After choosing the kernel parameter $σ$ in $p_{i j}$ as 1, finding the weight vector $w$ can be expressed as the following minimization problem for given $λ$ .

$\hat{w} = \underset{w}{argmin} f (w) = \underset{w}{argmin} \frac{1}{n} \sum_{i = 1}^{n} f_{i} (w),$

where f(w) = -F(w) and f_i(w) = -F_i(w).

Note that

$\frac{1}{n} \sum_{i = 1}^{n} \sum_{j = 1, j \neq i}^{n} p_{i j} = 1,$

and the argument of the minimum does not change if you add a constant to an objective function. Therefore, you can rewrite the objective function by adding the constant 1.

$\begin{matrix} \hat{w} = \underset{w}{argmin} {1 + f (w)} \\ = \underset{w}{argmin} {\frac{1}{n} \sum_{i = 1}^{n} \sum_{j = 1, j \neq i}^{n} p_{i j} - \frac{1}{n} \sum_{i = 1}^{n} \sum_{j = 1, j \neq i}^{n} p_{i j} y_{i j} + λ \sum_{r = 1}^{p} w_{r}^{2}} \\ = \underset{w}{argmin} {\frac{1}{n} \sum_{i = 1}^{n} \sum_{j = 1, j \neq i}^{n} p_{i j} (1 - y_{i j}) + λ \sum_{r = 1}^{p} w_{r}^{2}} \\ = \underset{w}{argmin} {\frac{1}{n} \sum_{i = 1}^{n} \sum_{j = 1, j \neq i}^{n} p_{i j} l (y_{i}, y_{j}) + λ \sum_{r = 1}^{p} w_{r}^{2}}, \end{matrix}$

where the loss function is defined as

$l (y_{i}, y_{j}) = {\begin{matrix} 1 & if y_{i} \neq y_{j,} \\ 0 & otherwise . \end{matrix}$

The argument of the minimum is the weight vector that minimizes the classification error. You can specify a custom loss function using the LossFunction name-value pair argument in the call to fscnca.

NCA Feature Selection for Regression

The fsrnca function performs NCA feature selection modified for regression. Given n observations

$\begin{array}{l} S = {(x_{i}, y_{i}), i = 1, 2, \dots, n} \end{array},$

the only difference from the classification problem is that the response values $y_{i} \in ℝ$ are continuous. In this case, the aim is to predict the response $y$ given the training set $S$ .

Consider a randomized regression model that:

Randomly picks a point ( $Ref (x)$ ) from $S$ as the ‘reference point’ for $x$
Sets the response value at $x$ equal to the response value of the reference point $Ref (x)$ .

Again, the probability $P (Ref (x) = x_{j} | S)$ that point $x_{j}$ is picked from $S$ as the reference point for $x$ is

$\begin{array}{l} P (Ref (x) = x_{j} | S) = \frac{k (d_{w} (x, x_{j}))}{\sum_{j = 1}^{n} k (d_{w} (x, x_{j}))} \end{array} .$

Now consider the leave-one-out application of this randomized regression model, that is, predicting the response for $x_{i}$ using the data in $S^{- i}$ , the training set $S$ excluding the point $(x_{i}, y_{i})$ . The probability that point $x_{j}$ is picked as the reference point for $x_{i}$ is

$p_{i j} = P (Ref (x_{i}) = x_{j} | S^{- i}) = \frac{k (d_{w} (x_{i}, x_{j}))}{\sum_{j = 1, j \neq i}^{n} k (d_{w} (x_{i}, x_{j}))} .$

Let ${\hat{y}}_{i}$ be the response value the randomized regression model predicts and $y_{i}$ be the actual response for $x_{i}$ . And let $l : ℝ^{2} \to ℝ$ be a loss function that measures the disagreement between ${\hat{y}}_{i}$ and $y_{i}$ . Then, the average value of $l (y_{i}, {\hat{y}}_{i})$ is

$l_{i} = E (l (y_{i}, {\hat{y}}_{i}) | S^{- i}) = \sum_{j = 1, j \neq i}^{n} p_{i j} l (y_{i}, y_{j}) .$

After adding the regularization term, the objective function for minimization is:

$f (w) = \frac{1}{n} \sum_{i = 1}^{n} l_{i} + λ \sum_{r = 1}^{p} w_{r}^{2} .$

The default loss function $l (y_{i}, y_{j})$ for NCA for regression is mean absolute deviation, but you can specify other loss functions, including a custom one, using the LossFunction name-value pair argument in the call to fsrnca.

Impact of Standardization

The regularization term drives the weights of irrelevant predictors to zero. In the objective functions for NCA for classification or regression, there is only one regularization parameter $λ$ for all weights. This fact requires the magnitudes of the weights to be comparable to each other. When the feature vectors $x_{i}$ in $S$ are in different scales, this might result in weights that are in different scales and not meaningful. To avoid this situation, standardize the predictors to have zero mean and unit standard deviation before applying NCA. You can standardize the predictors using the 'Standardize',true name-value pair argument in the call to fscnca or fsrnca.

Choosing the Regularization Parameter Value

It is usually necessary to select a value of the regularization parameter by calculating the accuracy of the randomized NCA classifier or regression model on an independent test set. If you use cross-validation instead of a single test set, select the $λ$ value that minimizes the average loss across the cross-validation folds. For examples, see Tune Regularization Parameter to Detect Features Using NCA for Classification and Tune Regularization Parameter in NCA for Regression.

References

[1] Yang, W., K. Wang, W. Zuo. "Neighborhood Component Feature Selection for High-Dimensional Data." Journal of Computers. Vol. 7, Number 1, January, 2012.