Linear classification learner template
templateLinear
creates a template suitable for fitting a linear classification model to highdimensional data for multiclass problems.
The template specifies the binary learner model, regularization type and strength, and solver, among other things. After creating the template, train the model by passing the template and data to fitcecoc
.
returns a linear classification learner template.t
= templateLinear()
If you specify a default template, then the software uses default values for all input arguments during training.
returns a template with additional options specified by one or more namevalue pair arguments. For example, you can specify to implement logistic regression, specify the regularization type or strength, or specify the solver to use for objectivefunction minimization.t
= templateLinear(Name,Value
)
If you display t
in the Command Window, then all options appear empty ([]
) except options that you specify using namevalue pair arguments. During training, the software uses default values for empty options.
Train an ECOC model composed of multiple binary, linear classification models.
Load the NLP data set.
load nlpdata
X
is a sparse matrix of predictor data, and Y
is a categorical vector of class labels. There are more than two classes in the data.
Create a default linearclassificationmodel template.
t = templateLinear();
To adjust the default values, see the NameValue Pair Arguments on templateLinear
page.
Train an ECOC model composed of multiple binary, linear classification models that can identify the product given the frequency distribution of words on a documentation web page. For faster training time, transpose the predictor data, and specify that observations correspond to columns.
X = X'; rng(1); % For reproducibility Mdl = fitcecoc(X,Y,'Learners',t,'ObservationsIn','columns')
Mdl = CompactClassificationECOC ResponseName: 'Y' ClassNames: [comm dsp ecoder fixedpoint ... ] ScoreTransform: 'none' BinaryLearners: {78x1 cell} CodingMatrix: [13x78 double] Properties, Methods
Alternatively, you can train an ECOC model composed of default linear classification models using 'Learners','Linear'
.
To conserve memory, fitcecoc
returns trained ECOC models composed of linear classification learners in CompactClassificationECOC
model objects.
Specify optional
commaseparated pairs of Name,Value
arguments. Name
is
the argument name and Value
is the corresponding value.
Name
must appear inside quotes. You can specify several name and value
pair arguments in any order as
Name1,Value1,...,NameN,ValueN
.
'Learner','logistic','Regularization','lasso','CrossVal','on'
specifies to implement logistic regression with a lasso penalty, and to implement 10fold crossvalidation.Lambda
— Regularization term strength'auto'
(default)  nonnegative scalar  vector of nonnegative valuesRegularization term strength, specified as the commaseparated pair consisting of 'Lambda'
and 'auto'
, a nonnegative scalar, or a vector of nonnegative values.
For 'auto'
, Lambda
= 1/n.
If you specify a crossvalidation, namevalue pair argument (e.g., CrossVal
), then n is the number of infold observations.
Otherwise, n is the training sample size.
For a vector of nonnegative values, templateLinear
sequentially optimizes the objective function for each distinct value in Lambda
in ascending order.
If Solver
is 'sgd'
or 'asgd'
and Regularization
is 'lasso'
, templateLinear
does not use the previous coefficient estimates as a warm start for the next optimization iteration. Otherwise, templateLinear
uses warm starts.
If Regularization
is 'lasso'
, then any coefficient estimate of 0 retains its value when templateLinear
optimizes using subsequent values in Lambda
.
templateLinear
returns coefficient estimates for each specified regularization strength.
Example: 'Lambda',10.^((10:2:2))
Data Types: char
 string
 double
 single
Learner
— Linear classification model type'svm'
(default)  'logistic'
Linear classification model type, specified as the commaseparated
pair consisting of 'Learner'
and 'svm'
or 'logistic'
.
In this table, $$f\left(x\right)=x\beta +b.$$
β is a vector of p coefficients.
x is an observation from p predictor variables.
b is the scalar bias.
Value  Algorithm  Response Range  Loss Function 

'svm'  Support vector machine  y ∊ {–1,1}; 1 for the positive class and –1 otherwise  Hinge: $$\ell \left[y,f\left(x\right)\right]=\mathrm{max}\left[0,1yf\left(x\right)\right]$$ 
'logistic'  Logistic regression  Same as 'svm'  Deviance (logistic): $$\ell \left[y,f\left(x\right)\right]=\mathrm{log}\left\{1+\mathrm{exp}\left[yf\left(x\right)\right]\right\}$$ 
Example: 'Learner','logistic'
Regularization
— Complexity penalty type'lasso'
 'ridge'
Complexity penalty type, specified as the commaseparated pair
consisting of 'Regularization'
and 'lasso'
or 'ridge'
.
The software composes the objective function for minimization
from the sum of the average loss function (see Learner
)
and the regularization term in this table.
Value  Description 

'lasso'  Lasso (L1) penalty: $$\lambda {\displaystyle \sum _{j=1}^{p}\left{\beta}_{j}\right}$$ 
'ridge'  Ridge (L2) penalty: $$\frac{\lambda}{2}{\displaystyle \sum _{j=1}^{p}{\beta}_{j}^{2}}$$ 
To specify the regularization term strength, which is λ in
the expressions, use Lambda
.
The software excludes the bias term (β_{0}) from the regularization penalty.
If Solver
is 'sparsa'
,
then the default value of Regularization
is 'lasso'
.
Otherwise, the default is 'ridge'
.
Tip
For predictor variable selection, specify 'lasso'
.
For more on variable selection, see Introduction to Feature Selection.
For optimization accuracy, specify 'ridge'
.
Example: 'Regularization','lasso'
Solver
— Objective function minimization technique'sgd'
 'asgd'
 'dual'
 'bfgs'
 'lbfgs'
 'sparsa'
 string array  cell array of character vectorsObjective function minimization technique, specified as the commaseparated pair consisting of 'Solver'
and a character vector or string scalar, a string array, or a cell array of character vectors with values from this table.
Value  Description  Restrictions 

'sgd'  Stochastic gradient descent (SGD) [4][2]  
'asgd'  Average stochastic gradient descent (ASGD) [7]  
'dual'  Dual SGD for SVM [1][6]  Regularization must be 'ridge' and Learner must be 'svm' . 
'bfgs'  BroydenFletcherGoldfarbShanno quasiNewton algorithm (BFGS) [3]  Inefficient if X is very highdimensional. 
'lbfgs'  Limitedmemory BFGS (LBFGS) [3]  Regularization must be 'ridge' . 
'sparsa'  Sparse Reconstruction by Separable Approximation (SpaRSA) [5]  Regularization must be 'lasso' . 
If you specify:
A ridge penalty (see Regularization
) and the predictor data set contains 100 or fewer predictor variables, then the default solver is 'bfgs'
.
An SVM model (see Learner
), a ridge penalty, and the predictor data set contains more than 100 predictor variables, then the default solver is 'dual'
.
A lasso penalty and the predictor data set contains 100 or fewer predictor variables, then the default solver is 'sparsa'
.
Otherwise, the default solver is 'sgd'
.
If you specify a string array or cell array of solver names, then, for
each value in Lambda
, the software uses the
solutions of solver j as a warm start for solver
j + 1.
Example: {'sgd' 'lbfgs'}
applies SGD to solve the
objective, and uses the solution as a warm start for
LBFGS.
Tip
SGD and ASGD can solve the objective function more quickly
than other solvers, whereas LBFGS and SpaRSA can yield more
accurate solutions than other solvers. Solver combinations
like {'sgd' 'lbfgs'}
and {'sgd'
'sparsa'}
can balance optimization speed and
accuracy.
When choosing between SGD and ASGD, consider that:
SGD takes less time per iteration, but requires more iterations to converge.
ASGD requires fewer iterations to converge, but takes more time per iteration.
If the predictor data is high dimensional and
Regularization
is
'ridge'
, set
Solver
to any of these combinations:
'sgd'
'asgd'
'dual'
if
Learner
is
'svm'
'lbfgs'
{'sgd','lbfgs'}
{'asgd','lbfgs'}
{'dual','lbfgs'}
if
Learner
is
'svm'
Although you can set other combinations, they often lead to solutions with poor accuracy.
If the predictor data is moderate through low dimensional
and Regularization
is
'ridge'
, set
Solver
to
'bfgs'
.
If Regularization
is
'lasso'
, set
Solver
to any of these combinations:
'sgd'
'asgd'
'sparsa'
{'sgd','sparsa'}
{'asgd','sparsa'}
Example: 'Solver',{'sgd','lbfgs'}
Beta
— Initial linear coefficient estimateszeros(p
,1)
(default)  numeric vector  numeric matrixInitial linear coefficient estimates (β),
specified as the commaseparated pair consisting of 'Beta'
and
a pdimensional numeric vector or a pbyL numeric
matrix. p is the number of predictor variables
in X
and L is the number of
regularizationstrength values (for more details, see Lambda
).
If you specify a pdimensional vector, then the software optimizes the objective function L times using this process.
The software optimizes using Beta
as
the initial value and the minimum value of Lambda
as
the regularization strength.
The software optimizes again using the resulting estimate
from the previous optimization as a warm start, and the next smallest value in Lambda
as
the regularization strength.
The software implements step 2 until it exhausts all
values in Lambda
.
If you specify a pbyL matrix,
then the software optimizes the objective function L times.
At iteration j
, the software uses Beta(:,
as
the initial value and, after it sorts j
)Lambda
in
ascending order, uses Lambda(
as
the regularization strength.j
)
If you set 'Solver','dual'
, then the software
ignores Beta
.
Data Types: single
 double
Bias
— Initial intercept estimateInitial intercept estimate (b), specified
as the commaseparated pair consisting of 'Bias'
and
a numeric scalar or an Ldimensional numeric vector. L is
the number of regularizationstrength values (for more details, see Lambda
).
If you specify a scalar, then the software optimizes the objective function L times using this process.
The software optimizes using Bias
as
the initial value and the minimum value of Lambda
as
the regularization strength.
The uses the resulting estimate as a warm start to
the next optimization iteration, and uses the next smallest value
in Lambda
as the regularization strength.
The software implements step 2 until it exhausts all
values in Lambda
.
If you specify an Ldimensional
vector, then the software optimizes the objective function L times.
At iteration j
, the software uses Bias(
as
the initial value and, after it sorts j
)Lambda
in
ascending order, uses Lambda(
as
the regularization strength.j
)
By default:
If Learner
is 'logistic'
,
then let g_{j} be 1 if Y(
is
the positive class, and 1 otherwise. j
)Bias
is the
weighted average of the g for training or, for
crossvalidation, infold observations.
If Learner
is 'svm'
,
then Bias
is 0.
Data Types: single
 double
FitBias
— Linear model intercept inclusion flagtrue
(default)  false
Linear model intercept inclusion flag, specified as the commaseparated
pair consisting of 'FitBias'
and true
or false
.
Value  Description 

true  The software includes the bias term b in the linear model, and then estimates it. 
false  The software sets b = 0 during estimation. 
Example: 'FitBias',false
Data Types: logical
PostFitBias
— Flag to fit linear model intercept after optimizationfalse
(default)  true
Flag to fit the linear model intercept after optimization, specified
as the commaseparated pair consisting of 'PostFitBias'
and true
or false
.
Value  Description 

false  The software estimates the bias term b and the coefficients β during optimization. 
true 
To estimate b, the software:

If you specify true
, then FitBias
must
be true.
Example: 'PostFitBias',true
Data Types: logical
Verbose
— Verbosity level0
(default)  1
Verbosity level, specified as the commaseparated pair consisting of 'Verbose'
and either 0
or 1
. Verbose
controls the display of diagnostic information at the command line.
Value  Description 

0  templateLinear does not display diagnostic information. 
1  templateLinear periodically displays the value of the objective function, gradient magnitude, and other diagnostic information. 
Example: 'Verbose',1
Data Types: single
 double
BatchSize
— Minibatch sizeMinibatch size, specified as the commaseparated pair consisting of 'BatchSize'
and a positive integer. At each iteration, the software estimates the gradient using BatchSize
observations from the training data.
If the predictor data is a numeric matrix, then the default value is 10
.
If the predictor data is a sparse matrix, then the default value is max([10,ceil(sqrt(ff))])
, where ff = numel(X)/nnz(X)
, that is, the fullness factor of X
.
Example: 'BatchSize',100
Data Types: single
 double
LearnRate
— Learning rateLearning rate, specified as the commaseparated pair consisting of 'LearnRate'
and a positive scalar. LearnRate
controls the optimization step size by scaling the subgradient.
If Regularization
is 'ridge'
, then LearnRate
specifies the initial learning rate γ_{0}. templateLinear
determines the learning rate for iteration t, γ_{t}, using
$${\gamma}_{t}=\frac{{\gamma}_{0}}{{\left(1+\lambda {\gamma}_{0}t\right)}^{c}}.$$
If Regularization
is 'lasso'
, then, for all iterations, LearnRate
is constant.
By default, LearnRate
is 1/sqrt(1+max((sum(X.^2,obsDim))))
, where obsDim
is 1
if the observations compose the columns of the predictor data X
, and 2
otherwise.
Example: 'LearnRate',0.01
Data Types: single
 double
OptimizeLearnRate
— Flag to decrease learning ratetrue
(default)  false
Flag to decrease the learning rate when the software detects
divergence (that is, overstepping the minimum), specified as the
commaseparated pair consisting of 'OptimizeLearnRate'
and true
or false
.
If OptimizeLearnRate
is 'true'
,
then:
For the few optimization iterations, the software
starts optimization using LearnRate
as the learning
rate.
If the value of the objective function increases, then the software restarts and uses half of the current value of the learning rate.
The software iterates step 2 until the objective function decreases.
Example: 'OptimizeLearnRate',true
Data Types: logical
TruncationPeriod
— Number of minibatches between lasso truncation runs10
(default)  positive integerNumber of minibatches between lasso truncation runs, specified
as the commaseparated pair consisting of 'TruncationPeriod'
and
a positive integer.
After a truncation run, the software applies a soft threshold
to the linear coefficients. That is, after processing k = TruncationPeriod
minibatches,
the software truncates the estimated coefficient j using
$${\widehat{\beta}}_{j}^{\ast}=\{\begin{array}{ll}{\widehat{\beta}}_{j}{u}_{t}\hfill & \text{if}\text{\hspace{0.17em}}{\widehat{\beta}}_{j}>{u}_{t},\hfill \\ 0\hfill & \text{if}\text{\hspace{0.17em}}\left{\widehat{\beta}}_{j}\right\le {u}_{t},\hfill \\ {\widehat{\beta}}_{j}+{u}_{t}\hfill & \text{if}\text{\hspace{0.17em}}{\widehat{\beta}}_{j}<{u}_{t}.\hfill \end{array}\begin{array}{r}\hfill \text{\hspace{0.17em}}\text{\hspace{0.17em}}\\ \hfill \text{\hspace{0.17em}}\text{\hspace{0.17em}}\\ \hfill \text{\hspace{0.17em}}\text{\hspace{0.17em}}\end{array}$$
For SGD, $${\widehat{\beta}}_{j}$$ is
the estimate of coefficient j after processing k minibatches. $${u}_{t}=k{\gamma}_{t}\lambda .$$ γ_{t} is
the learning rate at iteration t. λ is
the value of Lambda
.
For ASGD, $${\widehat{\beta}}_{j}$$ is the averaged estimate coefficient j after processing k minibatches, $${u}_{t}=k\lambda .$$
If Regularization
is 'ridge'
,
then the software ignores TruncationPeriod
.
Example: 'TruncationPeriod',100
Data Types: single
 double
BatchLimit
— Maximal number of batchesMaximal number of batches to process, specified as the commaseparated
pair consisting of 'BatchLimit'
and a positive
integer. When the software processes BatchLimit
batches,
it terminates optimization.
By default:
If you specify 'BatchLimit'
and '
PassLimit
'
,
then the software chooses the argument that results in processing
the fewest observations.
If you specify 'BatchLimit'
but
not 'PassLimit'
, then the software processes enough
batches to complete up to one entire pass through the data.
Example: 'BatchLimit',100
Data Types: single
 double
BetaTolerance
— Relative tolerance on linear coefficients and bias term1e4
(default)  nonnegative scalarRelative tolerance on the linear coefficients and the bias term (intercept), specified
as the commaseparated pair consisting of 'BetaTolerance'
and a
nonnegative scalar.
Let $${B}_{t}=\left[{\beta}_{t}{}^{\prime}\text{\hspace{0.17em}}\text{\hspace{0.17em}}{b}_{t}\right]$$, that is, the vector of the coefficients and the bias term at optimization iteration t. If $${\Vert \frac{{B}_{t}{B}_{t1}}{{B}_{t}}\Vert}_{2}<\text{BetaTolerance}$$, then optimization terminates.
If the software converges for the last solver specified in
Solver
, then optimization terminates. Otherwise, the software uses
the next solver specified in Solver
.
Example: 'BetaTolerance',1e6
Data Types: single
 double
NumCheckConvergence
— Number of batches to process before next convergence checkNumber of batches to process before next convergence check, specified as the
commaseparated pair consisting of 'NumCheckConvergence'
and a
positive integer.
To specify the batch size, see BatchSize
.
The software checks for convergence about 10 times per pass through the entire data set by default.
Example: 'NumCheckConvergence',100
Data Types: single
 double
PassLimit
— Maximal number of passes1
(default)  positive integerMaximal number of passes through the data, specified as the commaseparated pair consisting of 'PassLimit'
and a positive integer.
The software processes all observations when it completes one pass through the data.
When the software passes through the data PassLimit
times, it terminates optimization.
If you specify '
BatchLimit
'
and PassLimit
, then the software chooses the argument that results in processing the fewest observations.
Example: 'PassLimit',5
Data Types: single
 double
BetaTolerance
— Relative tolerance on linear coefficients and bias term1e4
(default)  nonnegative scalarRelative tolerance on the linear coefficients and the bias term (intercept), specified
as the commaseparated pair consisting of 'BetaTolerance'
and a
nonnegative scalar.
Let $${B}_{t}=\left[{\beta}_{t}{}^{\prime}\text{\hspace{0.17em}}\text{\hspace{0.17em}}{b}_{t}\right]$$, that is, the vector of the coefficients and the bias term at optimization iteration t. If $${\Vert \frac{{B}_{t}{B}_{t1}}{{B}_{t}}\Vert}_{2}<\text{BetaTolerance}$$, then optimization terminates.
If you also specify DeltaGradientTolerance
, then optimization
terminates when the software satisfies either stopping criterion.
If the software converges for the last solver specified in
Solver
, then optimization terminates. Otherwise, the software uses
the next solver specified in Solver
.
Example: 'BetaTolerance',1e6
Data Types: single
 double
DeltaGradientTolerance
— Gradientdifference tolerance1
(default)  nonnegative scalarGradientdifference tolerance between upper and lower pool KarushKuhnTucker
(KKT) complementarity conditions violators, specified as the
commaseparated pair consisting of 'DeltaGradientTolerance'
and
a nonnegative scalar.
If the magnitude of the KKT violators is less than DeltaGradientTolerance
,
then the software terminates optimization.
If the software converges for the last solver specified
in Solver
, then optimization terminates. Otherwise,
the software uses the next solver specified in Solver
.
Example: 'DeltaGapTolerance',1e2
Data Types: double
 single
NumCheckConvergence
— Number of passes through entire data set to process before next convergence check5
(default)  positive integerNumber of passes through entire data set to process before next convergence check,
specified as the commaseparated pair consisting of
'NumCheckConvergence'
and a positive integer.
Example: 'NumCheckConvergence',100
Data Types: single
 double
PassLimit
— Maximal number of passes10
(default)  positive integerMaximal number of passes through the data, specified as the
commaseparated pair consisting of 'PassLimit'
and
a positive integer.
When the software completes one pass through the data, it has processed all observations.
When the software passes through the data PassLimit
times,
it terminates optimization.
Example: 'PassLimit',5
Data Types: single
 double
BetaTolerance
— Relative tolerance on linear coefficients and bias term1e4
(default)  nonnegative scalarRelative tolerance on the linear coefficients and the bias term (intercept), specified as the commaseparated pair consisting of 'BetaTolerance'
and a nonnegative scalar.
Let $${B}_{t}=\left[{\beta}_{t}{}^{\prime}\text{\hspace{0.17em}}\text{\hspace{0.17em}}{b}_{t}\right]$$, that is, the vector of the coefficients and the bias term at optimization iteration t. If $${\Vert \frac{{B}_{t}{B}_{t1}}{{B}_{t}}\Vert}_{2}<\text{BetaTolerance}$$, then optimization terminates.
If you also specify GradientTolerance
, then optimization terminates when the software satisfies either stopping criterion.
If the software converges for the last solver specified in
Solver
, then optimization terminates. Otherwise, the software uses
the next solver specified in Solver
.
Example: 'BetaTolerance',1e6
Data Types: single
 double
GradientTolerance
— Absolute gradient tolerance1e6
(default)  nonnegative scalarAbsolute gradient tolerance, specified as the commaseparated pair consisting of 'GradientTolerance'
and a nonnegative scalar.
Let $$\nabla {\mathcal{L}}_{t}$$ be the gradient vector of the objective function with respect to the coefficients and bias term at optimization iteration t. If $${\Vert \nabla {\mathcal{L}}_{t}\Vert}_{\infty}=\mathrm{max}\left\nabla {\mathcal{L}}_{t}\right<\text{GradientTolerance}$$, then optimization terminates.
If you also specify BetaTolerance
, then optimization terminates when the
software satisfies either stopping criterion.
If the software converges for the last solver specified in the
software, then optimization terminates. Otherwise, the software uses
the next solver specified in Solver
.
Example: 'GradientTolerance',1e5
Data Types: single
 double
HessianHistorySize
— Size of history buffer for Hessian approximation15
(default)  positive integerSize of history buffer for Hessian approximation, specified
as the commaseparated pair consisting of 'HessianHistorySize'
and
a positive integer. That is, at each iteration, the software composes
the Hessian using statistics from the latest HessianHistorySize
iterations.
The software does not support 'HessianHistorySize'
for
SpaRSA.
Example: 'HessianHistorySize',10
Data Types: single
 double
IterationLimit
— Maximal number of optimization iterations1000
(default)  positive integerMaximal number of optimization iterations, specified as the
commaseparated pair consisting of 'IterationLimit'
and
a positive integer. IterationLimit
applies to these
values of Solver
: 'bfgs'
, 'lbfgs'
,
and 'sparsa'
.
Example: 'IterationLimit',500
Data Types: single
 double
t
— Linear classification model learner templateLinear classification model learner template, returned as a template object. To train a linear classification model using highdimensional data for multiclass problems, pass t
to fitcecoc
.
If you display t
to the Command Window, then all, unspecified options appear empty ([]
). However, the software replaces empty options with their corresponding default values during training.
A warm start is initial estimates of the beta coefficients and bias term supplied to an optimization routine for quicker convergence.
It is a best practice to orient your predictor matrix so that observations correspond to columns and to specify 'ObservationsIn','columns'
. As a result, you can experience a significant reduction in optimizationexecution time.
If the predictor data has few observations, but many predictor variables, then:
Specify 'PostFitBias',true
.
For SGD or ASGD solvers, set PassLimit
to a positive integer that is greater than 1, for example, 5 or 10. This setting often results in better accuracy.
For SGD and ASGD solvers, BatchSize
affects the rate of convergence.
If BatchSize
is too small, then the software achieves the minimum in many iterations, but computes the gradient per iteration quickly.
If BatchSize
is too large, then the software achieves the minimum in fewer iterations, but computes the gradient per iteration slowly.
Large learning rate (see LearnRate
) speedup convergence to the minimum, but can lead to divergence (that is, overstepping the minimum). Small learning rates ensure convergence to the minimum, but can lead to slow termination.
If Regularization
is 'lasso'
, then experiment with various values of TruncationPeriod
. For example, set TruncationPeriod
to 1
, 10
, and then 100
.
For efficiency, the software does not standardize predictor data. To standardize the predictor data (X
), enter
X = bsxfun(@rdivide,bsxfun(@minus,X,mean(X,2)),std(X,0,2));
The code requires that you orient the predictors and observations as the rows and columns of X
, respectively. Also, for memoryusage economy, the code replaces the original predictor data the standardized data.
[1] Hsieh, C. J., K. W. Chang, C. J. Lin, S. S. Keerthi, and S. Sundararajan. “A Dual Coordinate Descent Method for LargeScale Linear SVM.” Proceedings of the 25th International Conference on Machine Learning, ICML ’08, 2001, pp. 408–415.
[2] Langford, J., L. Li, and T. Zhang. “Sparse Online Learning Via Truncated Gradient.” J. Mach. Learn. Res., Vol. 10, 2009, pp. 777–801.
[3] Nocedal, J. and S. J. Wright. Numerical Optimization, 2nd ed., New York: Springer, 2006.
[4] ShalevShwartz, S., Y. Singer, and N. Srebro. “Pegasos: Primal Estimated SubGradient Solver for SVM.” Proceedings of the 24th International Conference on Machine Learning, ICML ’07, 2007, pp. 807–814.
[5] Wright, S. J., R. D. Nowak, and M. A. T. Figueiredo. “Sparse Reconstruction by Separable Approximation.” Trans. Sig. Proc., Vol. 57, No 7, 2009, pp. 2479–2493.
[6] Xiao, Lin. “Dual Averaging Methods for Regularized Stochastic Learning and Online Optimization.” J. Mach. Learn. Res., Vol. 11, 2010, pp. 2543–2596.
[7] Xu, Wei. “Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent.” CoRR, abs/1107.2490, 2011.
Usage notes and limitations when you train a model by passing a linear model template and tall arrays to fitcecoc
:
The default values for these namevalue pair arguments are different when you work with tall arrays.
'Lambda'
— Can be 'auto'
(default)
or a scalar
'Regularization'
— Supports only
'ridge'
'Solver'
— Supports only
'lbfgs'
'FitBias'
— Supports only
true
'Verbose'
— Default value is
1
'BetaTolerance'
— Default value is relaxed to
1e–3
'GradientTolerance'
— Default value is relaxed to
1e–3
'IterationLimit'
— Default value is relaxed to
20
When fitcecoc
uses a templateLinear
object with tall arrays, the only available solver is LBFGS. The software implements
LBFGS by distributing the calculation of the loss and gradient among different parts
of the tall array at each iteration. If you do not specify initial values for
Beta
and Bias
, the software refines
the initial estimates of the parameters by fitting the model locally to parts of the
data and combining the coefficients by averaging.
For more information, see Tall Arrays.
You have a modified version of this example. Do you want to open this example with your edits?
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
Select web siteYou can also select a web site from the following list:
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.