Deploy Tall Arrays to a CLOUDERA Spark Enabled Hadoop Cluster
This example shows how to deploy a MATLAB® application containing tall arrays to a CLOUDERA® Spark™ enabled Hadoop® cluster.
Deploying MATLAB applications against a CLOUDERA distribution of Spark requires a special wrapper type that you generate using the
mcc
command. This wrapper type generates a
jar
file as well as a shell script which calls
spark-submit
. The spark-submit
script in
the Spark
bin
directory is used to start applications on a cluster. It
supports both yarn-client
mode and
yarn-cluster
mode.
The inputs to the application are:
master
— URL to the Spark clusterinputFile
— the file containing the input dataoutputFile
— the file containing the results of the computation
Note
The complete code for this example is in the file meanArrivalDemo.m
,
as shown below.
Prerequisites
Install the MATLAB Runtime in the default location on the desktop. This example uses
/usr/local/MATLAB/MATLAB_Runtime/v91
as the default location for the MATLAB Runtime. If you don’t have MATLAB Runtime, see Download and Install MATLAB Runtime for installation instructions.Install the MATLAB Runtime on every worker node.
Copy the
airlinesmall.csv
from foldertoolbox/matlab/demos
of your MATLAB install area into Hadoop Distributed File System (HDFS™) folder/datasets/airlinemod
.
Deploy Tall Arrays
At the MATLAB command prompt, use the
mcc
command to generate ajar
file and shell script for the MATLAB applicationmeanArrivalDemo.m
.>> mcc -vCW 'Spark:meanArrivalDemoApp' meanArrivalDemo.m
This action creates a
jar
file namedmeanArrivalDempApp.jar
and a shell script namedrun_meanArrivalDemoApp.sh
.Note
To use the shell script, set up the environment variables
HADOOP_PREIX
,HADOOP_CONF_DIR
andSPARK_HOME
.Execute the shell script in either
yarn-client
mode oryarn-cluster
mode. Inyarn-client
mode, the driver runs on the desktop. Inyarn-cluster
mode, the driver runs in the Application Master process in the cluster.The general syntax to execute the shell script is:
./run_meanArrivalDemoApp.sh <runtime install root> [Spark arguments] [Application arguments]
yarn-client
modeRun the following command from a Linux® terminal:
$ ./run_meanArrivalDemoApp.sh \ /usr/local/MATLAB/MATLAB_Runtime/v91 \ yarn-client \ hdfs://hadoop01glnxa64:54310/datasets/airlinemod/airlinesmall.csv \ hdfs://hadoop01glnxa64:54310/user/someuser/meanArrivalResult
To examine the result, enter the following from the MATLAB command prompt:
>> ds = datastore('hdfs:///user/someuser/meanArrivalResult/*'); >> readall(ds)
yarn-cluster
modeRun the following command from a Linux terminal:
$ ./run_meanArrivalDemoApp.sh \ /usr/local/MATLAB/MATLAB_Runtime/v91 \ --deploy-mode cluster --master yarn yarn-cluster \ hdfs://hadoop01glnxa64:54310/datasets/airlinemod/airlinesmall.csv \ hdfs://hadoop01glnxa64:54310/user/someuser/meanArrivalResult
In
yarn-cluster
mode, since the driver is running on a worker node in the cluster, any standard output from the MATLAB function is not displayed on your desktop. In addition, files can be saved anywhere. To prevent such behavior, this example uses thewrite
function to explicitly save the results to a particular location in HDFS.