KeyValueDatastore

Datastore for key-value pair data for use with mapreduce

Description

KeyValueDatastore objects are associated with files containing key-value pair data that are outputs of or inputs to mapreduce. Use the KeyValueDatastore properties to specify how you want to access the data. Use dot notation to view or modify a particular property of a KeyValueDatastore object:

ds = datastore('mapredout.mat');
ds.ReadSize = 20;

You also can specify the value of KeyValueDatastore properties using name-value pair arguments when you create a datastore using the datastore function:

ds = datastore('mapredout.mat','ReadSize',20);

Creation

Create KeyValueDatastore objects using the datastore function.

Properties

expand all

Files included in the datastore, specified as an n-by-1 cell array of character vectors or string array, where each character vector or string is a full path to a file. These are the files defined by the location argument to the datastore function. The location argument contains full paths to files on a local file system, a network file system, or a supported remote location such as Amazon S3™, Windows Azure® Blob Storage, and HDFS™. For more information, see Work with Remote Data.

The files must be either MAT-files or Sequence files generated by the mapreduce function.

Example: {'C:\dir\data\file1.mat';'C:\dir\data\file2.mat'}

Example: {'s3://bucketname/path_to_files/your_file01.mat';'s3://bucketname/path_to_files/your_file02.mat'}

Data Types: cell | string

File type, specified as either 'mat' for MAT-files or 'seq' for sequence files. By default, the output of mapreduce running against Hadoop® is a datastore containing sequence files. By default, the output of all other mapreduce operations is a datastore containing MAT-files.

Data Types: cell | string

Maximum number of key-value pairs to read in a call to the read or preview functions, specified as a positive integer.

Alternate file system root paths, specified as the comma-separated pair consisting of 'AlternateFileSystemRoots' and a string vector or a cell array. Use 'AlternateFileSystemRoots' when you create a datastore on a local machine, but need to access and process the data on another machine (possibly of a different operating system). Also, when processing data using the Parallel Computing Toolbox™ and the MATLAB® Parallel Server™, and the data is stored on your local machines with a copy of the data available on different platform cloud or cluster machines, you must use 'AlternateFileSystemRoots' to associate the root paths.

  • To associate a set of root paths that are equivalent to one another, specify 'AlternateFileSystemRoots' as a string vector. For example,

    ["Z:\datasets","/mynetwork/datasets"]

  • To associate multiple sets of root paths that are equivalent for the datastore, specify 'AlternateFileSystemRoots' as a cell array containing multiple rows where each row represents a set of equivalent root paths. Specify each row in the cell array as either a string vector or a cell array of character vectors. For example:

    • Specify 'AlternateFileSystemRoots' as a cell array of string vectors.

      {["Z:\datasets", "/mynetwork/datasets"];...
       ["Y:\datasets", "/mynetwork2/datasets","S:\datasets"]}

    • Alternatively, specify 'AlternateFileSystemRoots' as a cell array of cell array of character vectors.

      {{'Z:\datasets','/mynetwork/datasets'};...
       {'Y:\datasets', '/mynetwork2/datasets','S:\datasets'}}

The value of 'AlternateFileSystemRoots' must satisfy these conditions:

  • Contains one or more rows, where each row specifies a set of equivalent root paths.

  • Each row specifies multiple root paths and each root path must contain at least two characters.

  • Root paths are unique and are not subfolders of one another.

  • Contains at least one root path entry that points to the location of the files.

For more information, see Set Up Datastore for Processing on Different Machines or Clusters.

Example: ["Z:\datasets","/mynetwork/datasets"]

Data Types: string | cell

Object Functions

hasdataDetermine if data is available to read
numpartitionsNumber of datastore partitions
partitionPartition a datastore
previewSubset of data in datastore
readRead data in datastore
readallRead all data in datastore
resetReset datastore to initial state
transformTransform datastore
combineCombine data from multiple datastores

Examples

collapse all

Create a datastore from the sample file, mapredout.mat, which is an output file of the mapreduce function.

ds = datastore('mapredout.mat')
ds = 

  KeyValueDatastore with properties:

       Files: {
              ' ...\matlab\toolbox\matlab\demos\mapredout.mat'
              }
    ReadSize: 1 key-value pairs
    FileType: 'mat'

Set the ReadSize property to 8 so that each call to read reads at most 8 key-value pairs.

ds.ReadSize = 8
ds = 

  KeyValueDatastore with properties:

       Files: {
              ' ...\matlab\toolbox\matlab\demos\mapredout.mat'
              }
    ReadSize: 8 key-value pairs
    FileType: 'mat'

Read 8 key-value pairs at a time using the read function in a while loop. The loop executes until there is no more data available to read and hasdata(ds) returns false.

while hasdata(ds)
    T = read(ds);
end

Show the last set of key-value pairs read.

T
T = 

    Key     Value 
    ____    ______

    'OO'    [3090]
    'TZ'    [ 216]
    'XE'    [2357]
    '9E'    [ 521]
    'YV'    [ 849]

Limitations

  • KeyValueDatastore does not support sequence files written in R2013b. Rewrite the sequence files using a version of MATLAB between R2014a and R2018a.

Introduced in R2014b