The predict , loss , margin ,
and edge methods of these classification classes
support tall arrays:  
The resume method of ClassificationKernel supports tall arrays. 
The default value for the 'IterationLimit' namevalue pair
argument is relaxed to 20 when working with tall arrays. resume uses a
blockwise strategy. For details, see Algorithms of
fitckernel .

fitcdiscr  Supported syntaxes are: Mdl = fitcdiscr(Tbl,Y)
Mdl = fitcdiscr(X,Y)
Mdl = fitcdiscr(___,Name,Value)
[Mdl,FitInfo,HyperparameterOptimizationResults] =
fitcdiscr(___,Name,Value) — fitcdiscr
returns the additional output arguments FitInfo and
HyperparameterOptimizationResults when you specify the
'OptimizeHyperparameters' namevalue pair
argument.
The FitInfo output argument is an empty structure array currently
reserved for possible future use. The HyperparameterOptimizationResults output argument is a BayesianOptimization object or a
table of hyperparameters with associated values that describe the crossvalidation
optimization of hyperparameters. 'HyperparameterOptimizationResults' is nonempty when the
'OptimizeHyperparameters' namevalue pair argument is
nonempty at the time you create the model. The values in
'HyperparameterOptimizationResults' depend on the value
you specify for the 'HyperparameterOptimizationOptions'
namevalue pair argument when you create the model.
If you specify 'bayesopt' (default), then
HyperparameterOptimizationResults
is an object of class BayesianOptimization . If you specify 'gridsearch' or
'randomsearch' , then
HyperparameterOptimizationResults
is a table of the hyperparameters used, observed objective
function values (crossvalidation loss), and rank of
observations from lowest (best) to highest (worst).
Supported namevalue pair arguments, and any differences, are:
'ClassNames'
'Cost'
'DiscrimType'
'HyperparameterOptimizationOptions' — For
crossvalidation, tall optimization supports only
'Holdout' validation. For example, you can specify
fitcdiscr(X,Y,'OptimizeHyperparameters','auto','HyperparameterOptimizationOptions',struct('Holdout',0.2)) .
'OptimizeHyperparameters' — The only eligible
parameter to optimize is 'DiscrimType' . Specifying
'auto' uses 'DiscrimType' .
'PredictorNames'
'Prior'
'ResponseName'
'ScoreTransform'
'Weights'
For tall arrays and tall tables, fitcdiscr returns
a CompactClassificationDiscriminant object, which
contains most of the same properties as a ClassificationDiscriminant object.
The main difference is that the compact object is sensitive to memory
requirements. The compact object does not include properties that
include the data, or that include an array of the same size as the
data. The compact object does not contain these ClassificationDiscriminant properties:
Additionally, the compact object does not support these ClassificationDiscriminant methods:
compact
crossval
cvshrink
resubEdge
resubLoss
resubMargin
resubPredict

fitcecoc 
Supported syntaxes are: Mdl = fitcecoc(X,Y)
Mdl = fitcecoc(X,Y,Name,Value)
[Mdl,FitInfo,HyperparameterOptimizationResults] =
fitcecoc(X,Y,Name,Value) — fitcecoc returns the
additional output arguments FitInfo and HyperparameterOptimizationResults when you specify the
'OptimizeHyperparameters' namevalue pair
argument.
The FitInfo output argument is an empty structure array currently
reserved for possible future use. Options related to crossvalidation are not supported. The supported namevalue pair
arguments are: 'ClassNames'
'Cost'
'Coding' — Default value is
'onevsall' .
'HyperparameterOptimizationOptions' — For
crossvalidation, tall optimization supports only 'Holdout'
validation. For example, you can specify
fitcecoc(X,Y,'OptimizeHyperparameters','auto','HyperparameterOptimizationOptions',struct('Holdout',0.2)) .
'Learners' — Default value is 'linear' .
You can specify 'linear' ,'kernel' , a
templateLinear or templateKernel object,
or a cell array of such objects.
'OptimizeHyperparameters' — When you use linear
binary learners, the value of the 'Regularization'
hyperparameter must be 'ridge' .
'Prior'
'Verbose' — Default value is 1 .
'Weights'
This additional namevalue pair argument is specific to tall arrays: 'NumConcurrent' — A positive integer scalar specifying the
number of binary learners that are trained concurrently by combining file I/O
operations. The default value for 'NumConcurrent' is
1 , which means fitcecoc trains the
binary learners sequentially. 'NumConcurrent' is most
beneficial when the input arrays cannot fit into the distributed cluster memory.
Otherwise, the input arrays can be cached and speedup is negligible.
If you run your code on Apache Spark™, NumConcurrent is upper bounded by the memory
available for communications. Check the
'spark.executor.memory' and
'spark.driver.memory' properties in your Apache Spark configuration. See parallel.cluster.Hadoop for more details. For more information
on Apache Spark and other execution environments that control where your code
runs, see Extend Tall Arrays with Other Products (MATLAB).

fitckernel 
Some namevalue pair arguments have different defaults compared to the default values
for the inmemory fitckernel function. Supported namevalue pair
arguments, and any differences, are: 'Learner'
'NumExpansionDimensions'
'KernelScale'
'BoxConstraint'
'Lambda'
'BetaTolerance' — Default value is relaxed to
1e–3 .
'GradientTolerance' — Default value is relaxed to
1e–5 .
'IterationLimit' — Default value is relaxed to
20 .
'BlockSize'
'RandomStream'
'HessianHistorySize'
'Verbose' — Default value is
1 .
'ClassNames'
'Cost'
'Prior'
'ScoreTransform'
'Weights' — Value must be a tall array.
'OptimizeHyperparameters'
'HyperparameterOptimizationOptions' — For
crossvalidation, tall optimization supports only 'Holdout'
validation. For example, you can specify
fitckernel(X,Y,'OptimizeHyperparameters','auto','HyperparameterOptimizationOptions',struct('Holdout',0.2)) .
If 'KernelScale' is 'auto' , then
fitckernel uses the random stream controlled by tallrng
for subsampling. For reproducibility, you must set a random number seed for both the
global stream and the random stream controlled by tallrng . If 'Lambda' is 'auto' , then
fitckernel might take an extra pass through the data to
calculate the number of observations in X . fitckernel uses a blockwise strategy. For details, see Algorithms.

templateKernel 
The default values for these namevalue pair arguments are different when you work
with tall arrays. 'Verbose' — Default value is
1 .
'BetaTolerance' — Default value is relaxed to
1e–3 .
'GradientTolerance' — Default value is relaxed to
1e–5 .
'IterationLimit' — Default value is relaxed to
20 .
If 'KernelScale'
is 'auto' , then templateKernel uses the
random stream controlled by tallrng for subsampling. For reproducibility, you must set a random
number seed for both the global stream and the random stream controlled by
tallrng . If 'Lambda' is 'auto' , then
templateKernel might take an extra pass through the data to
calculate the number of observations. templateKernel uses a blockwise strategy. For details, see
Algorithms.

fitclinear  Some namevalue pair arguments have different defaults compared to the default values
for the inmemory fitclinear function. Supported namevalue pair
arguments, and any differences, are: 'ObservationsIn' — Supports only
'rows' .
'Lambda' — Can be 'auto'
(default) or a scalar.
'Learner'
'Regularization' — Supports only
'ridge' .
'Solver' — Supports only
'lbfgs' .
'FitBias' — Supports only
true .
'Verbose' — Default value is
1 .
'Beta'
'Bias'
'ClassNames'
'Cost'
'Prior'
'Weights' — Value must be a tall array.
'HessianHistorySize'
'BetaTolerance' — Default value is relaxed to
1e–3 .
'GradientTolerance' — Default value is relaxed to
1e–3 .
'IterationLimit' — Default value is relaxed to
20 .
'OptimizeHyperparameters' — Value of
'Regularization' parameter must be
'ridge' .
'HyperparameterOptimizationOptions' — For
crossvalidation, tall optimization supports only 'Holdout'
validation. For example, you can specify
fitclinear(X,Y,'OptimizeHyperparameters','auto','HyperparameterOptimizationOptions',struct('Holdout',0.2)) .
For tall arrays, fitclinear implements
LBFGS by distributing the calculation of the loss and gradient among
different parts of the tall array at each iteration. Other solvers
are not available for tall arrays. When initial values for Beta and Bias are
not given, fitclinear refines the initial estimates
of the parameters by fitting the model locally to parts of the data
and combining the coefficients by averaging.

templateLinear 
The default values for these namevalue pair arguments are different when you work with tall arrays. 'Lambda' — Can be 'auto' (default)
or a scalar
'Regularization' — Supports only
'ridge'
'Solver' — Supports only
'lbfgs'
'FitBias' — Supports only
true
'Verbose' — Default value is
1
'BetaTolerance' — Default value is relaxed to
1e–3
'GradientTolerance' — Default value is relaxed to
1e–3
'IterationLimit' — Default value is relaxed to
20
When fitcecoc uses a templateLinear
object with tall arrays, the only available solver is LBFGS. The software implements
LBFGS by distributing the calculation of the loss and gradient among different parts
of the tall array at each iteration. If you do not specify initial values for
Beta and Bias , the software refines
the initial estimates of the parameters by fitting the model locally to parts of the
data and combining the coefficients by averaging.

fitcnb  
fitctree 
Supported syntaxes are: tree = fitctree(Tbl,Y)
tree = fitctree(X,Y)
tree =
fitctree(___,Name,Value)
[tree,FitInfo,HyperparameterOptimizationResults]
= fitctree(___,Name,Value) —
fitctree returns the
additional output arguments
FitInfo and
HyperparameterOptimizationResults
when you specify the
'OptimizeHyperparameters'
namevalue pair argument.
The FitInfo output argument is an empty structure array currently
reserved for possible future use. The HyperparameterOptimizationResults output argument is a BayesianOptimization object or a
table of hyperparameters with associated values that describe the crossvalidation
optimization of hyperparameters. 'HyperparameterOptimizationResults' is nonempty when the
'OptimizeHyperparameters' namevalue pair argument is
nonempty at the time you create the model. The values in
'HyperparameterOptimizationResults' depend on the value
you specify for the 'HyperparameterOptimizationOptions'
namevalue pair argument when you create the model.
If you specify 'bayesopt' (default), then
HyperparameterOptimizationResults
is an object of class BayesianOptimization . If you specify 'gridsearch' or
'randomsearch' , then
HyperparameterOptimizationResults
is a table of the hyperparameters used, observed objective
function values (crossvalidation loss), and rank of
observations from lowest (best) to highest (worst).
Supported namevalue pair arguments, and any
differences, are: 'AlgorithmForCategorical'
'CategoricalPredictors'
'ClassNames'
'Cost'
'HyperparameterOptimizationOptions'
— For crossvalidation, tall optimization
supports only 'Holdout'
validation. For example, you can specify
fitctree(X,Y,'OptimizeHyperparameters','auto','HyperparameterOptimizationOptions',struct('Holdout',0.2)) .
'MaxNumCategories'
'MaxNumSplits' — for tall
optimization, fitctree
searches among integers, by default logscaled in
the range
[1,max(2,min(10000,NumObservations1))] .
'MergeLeaves'
'MinLeafSize'
'MinParentSize'
'NumVariablesToSample'
'OptimizeHyperparameters'
'PredictorNames'
'Prior'
'ResponseName'
'ScoreTransform'
'SplitCriterion'
'Weights'
This additional namevalue pair argument is specific to tall
arrays: 'MaxDepth' — A positive integer
specifying the maximum depth of the output tree.
Specify a value for this argument to return a tree
that has fewer levels and requires fewer passes
through the tall array to compute. Generally, the
algorithm of fitctree takes
one pass through the data and an additional pass
for each tree level. The function does not set a
maximum tree depth, by default.

TreeBagger  Supported syntaxes for tall X , Y , Tbl are: B = TreeBagger(NumTrees,Tbl,Y)
B = TreeBagger(NumTrees,X,Y)
B = TreeBagger(___,Name,Value)
For tall arrays, TreeBagger supports classification but does not
support regression. Supported namevalue pairs are: 'NumPredictorsToSample' —
Default value is the square root of the number of variables for classification.
'MinLeafSize' — Default value is 1 if the number
of observations is less than 50,000. If the number of observations is 50,000 or
greater, then the default value is
max(1,min(5,floor(0.01*NobsChunk))) , where
NobsChunk is the number of observations in a
chunk.
'ChunkSize' (only for tall arrays)
— Default value is 50000 .
In addition, TreeBagger supports these optional arguments of fitctree :
For tall data, TreeBagger returns
a CompactTreeBagger object
that contains most of the same properties as a full TreeBagger object.
The main difference is that the compact object is more memory efficient.
The compact object does not include properties that include the data,
or that include an array of the same size as the data. The number of trees contained in the returned CompactTreeBagger object
can differ from the number of trees specified as input to the
TreeBagger function. TreeBagger determines
the number of trees to return based on factors that include the size of the input data
set and the number of data chunks available to grow trees. Supported CompactTreeBagger methods
are: combine
error
margin
meanMargin
predict
setDefaultYfit
The error , margin ,
meanMargin , and predict methods do not support the namevalue pair arguments
'Trees' , 'TreeWeights' , or
'UseInstanceForTree' . The meanMargin method additionally does not support
'Weights' . TreeBagger creates a random forest
by generating trees on disjoint chunks of the data. When more data
is available than is required to create the random forest, the data
is subsampled. For a similar example, see Random Forests for Big Data (Genuer,
Poggi, TuleauMalot, VillaVialaneix 2015).
Depending on how the data is stored, it is possible that some chunks of data contain
observations from only a few classes out of all the classes. In this case,
TreeBagger might produce inferior results compared to the case
where each chunk of data contains observations from most of the classes. During training of the TreeBagger algorithm, the speed, accuracy,
and memory usage depend on a number of factors. These factors include values for
NumTrees , 'ChunkSize' ,
'MinLeafSize' , and 'MaxNumSplits' . For an nbyp tall array X ,
TreeBagger implements sampling during training. This sampling
depends on these variables:
Because the value of n is fixed for a given
X , your settings for NumTrees and
'ChunkSize' determine how TreeBagger samples
X .
If r > NumTrees , then
TreeBagger samples 'ChunkSize' *
NumTrees observations from X , and trains
one tree per chunk (with each chunk containing
'ChunkSize' number of observations). This scenario is
the most common when you work with tall arrays. If r ≤ NumTrees , then
TreeBagger trains approximately
NumTrees/r trees in each
chunk, using bootstrapping within the chunk. If n ≤ 'ChunkSize' , then
TreeBagger uses bootstrapping to generate samples
(each of size n) on which to train individual
trees.
When specifying a value for NumTrees , consider the following:
If you run your code on Apache Spark, and your data set is distributed with Hadoop^{®} Distributed File System (HDFS™), start by specifying a value for NumTrees
that is at least twice the number of partitions in HDFS for your data set. This setting prevents excessive data
communication among Apache Spark executors and can improve performance of the
TreeBagger algorithm. TreeBagger copies fitted trees into the client memory
in the resulting CompactTreeBagger model. Therefore, the
amount of memory available to the client creates an upper bound on the value
you can set for NumTrees . You can tune the values of
'MinLeafSize' and 'MaxNumSplits'
for more efficient speed and memory usage at the expense of some predictive
accuracy. After tuning, if the value of NumTrees is less
than twice the number of partitions in HDFS for your data set, then consider repartitioning your data in
HDFS to have larger partitions.
After specifying a value for NumTrees , set
'ChunkSize' to ensure that TreeBagger uses
most of the data to grow trees. Ideally, 'ChunkSize' * NumTrees
should approximate n, the number of rows in your data. Note that the
memory available in the workers for training individual trees can also determine an
upper bound for 'ChunkSize' . You can adjust the Apache Spark memory properties to avoid outofmemory errors and
support your workflow. See parallel.cluster.Hadoop for more information.
