How to correctly read in variable names using datastore and tall functions
22 views (last 30 days)
Show older comments
I'm trying to read in data from a number of different sources, do some work to collate and then write to appropriate files for staging before loading as a datastore and then working as a tall timetable type. The issue I have at the moment is that the datastore does not seem to handle different variable names as loaded into the files. For example, if the variables in file 1 are Time, Var_1, Array_1 and Array_2 while file 2 has Time, Var_1, Array_2 and Array_3, I would expect that the datastore should match Time and Var_1 whilst having a column for Array_1,Array_2 and Array_3 with NaN's in the appropriate rows. What happens instead is that the variable names are read only from the first file and subsequent files are assumed to have the same headings. In the event that the first file has more columns than subsequent files, rows from subsequent files are padded with NaN's at the end. If the first file has fewer columns than subsequent files, the subsequent files are truncated and incorrectly assigned variable names based on the first file headers. I have put together a MWE below. I have tried a few different types of files (csv and xls) and the same problem occurs. Any workarounds or fixes would be appreciated, I'm looking at around 25 million rows so prefer to use the tall structures if possible.
As a side note, when using an array as an argument into table, the expansion of the array doesn't occur until it is written to disk which is why I have to edit the xls files afterwards to correct the headings.
%create 2 rows of test data
TimeVector(1)=datetime;
pause(1)
TimeVector(2)=datetime;
Var1=[1;2];
Array1=[3 4 5 6 7 8 9;13 14 15 16 17 18 19];
VarNames={'Var1';...
'Array_3';...
'Array_4';...
'Array_5';...
'Array_6';...
'Array_7';...
'Array_8';...
'Array_9';
};
VarNames2=VarNames(3:end);
%create the timetable
TT = timetable(TimeVector',Var1,Array1); %Cannot add variable names here as there are only two variables, the
% expansion of Array1 has not yet occurred
%convert to table so it can be saved to local disk
TT2=timetable2table(TT); %Expansion of Array has not occurred
%create test subset consisting of only Var1, Array_5, Array_6 etc and
%missing Array_3 and Array_4
TT_sub = timetable(TimeVector',Var1,Array1(:,3:end));
TT2_sub=timetable2table(TT_sub);
%write the timetable data to csv, only works on tables, not timetables
writetable(TT2,'C:\temp\Matlab\test.xls','filetype','spreadsheet');
writetable(TT2_sub,'C:\temp\Matlab\test2.xls','filetype','spreadsheet');
%hack the spreadsheet files with the correct variable names
%This is needed as the array variables have now been expanded and given
%default variable names
xlswrite('C:\temp\Matlab\test.xls',VarNames','Sheet1','B1')
xlswrite('C:\temp\Matlab\test2.xls',VarNames2','Sheet1','B1')
file_list={'C:\temp\Matlab\test.xls';'C:\temp\Matlab\test2.xls'};
%read back using datastore, note that reversing the order of file_list also
%results in incorrect (but different) behaviour.
ds=datastore(file_list);
OutputT=tall(ds)
OutputTT=table2timetable(OutputT);
% not required for example but tall
% timetable is the desired final format
0 Comments
Answers (1)
Edric Ellis
on 5 Dec 2017
The datastore documentation states that variable names are read only from the first file. You can work around this by using fileDatastore, and using a 'ReadFcn' that sorts out the variables for you, like so:
function t = readTableWithVars(fname, varNames)
%readTableWithVars Call READTABLE and emit table with specified variables
t = readtable(fname);
% Get the original variables from the file
origNames = t.Properties.VariableNames;
% Work out which variables are missing
varsToAdd = setdiff(varNames, origNames);
% Add variables - as NaN
t{:, varsToAdd} = NaN;
% Reorder variables and select only those required.
t = t(:, varNames);
end
Then, you need to make a fileDatastore like this:
fds = fileDatastore(file_list, 'ReadFcn', ...
@(fname) readTableWithVars(fname, {'Time', 'Array_3', 'Foo'}), ...
'UniformRead', true)
0 Comments
See Also
Categories
Find more on Parallel for-Loops (parfor) in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!