Reading and processing data from text file to matlab variable quickly

5 views (last 30 days)
I use the following code to read data from a text file and process it into two cell arrays, and it works, but can it be done faster? Although I currently need the cell array data format for the downstream code that uses the data, I am also open to consider other data types, if they help reading more quickly from the text file.
adjlist = regexp(fileread('sample_input.txt'), '\r\n', 'split');
adjlist(cellfun('isempty', adjlist)) = [];
nodes = regexp(adjlist, '\w*(?= )', 'match');
nodes = cell2mat(nodes);
edges = regexp(adjlist, '(?<=( |,))\w*', 'match');
  2 Comments
dpb
dpb on 25 Feb 2017
The time overhead is likely not in the file reading portion but the regexp processing afterwards; it is pretty notorious for not being a performance speed demon. You're reading the file as just a cellstr array so I suspect that's not the issue.
Try breaking out the fileread from the surrounding regexp and profile the result; I'll be quite surprised if the above supposition doesn't turn out to be true.
Paolo Binetti
Paolo Binetti on 26 Feb 2017
You are right, the bottleneck are the three regexp instructions. I have reworded my question slightly, I hope it is clearer. Or do you suggest recasting the problem just in term of regexp?

Sign in to comment.

Accepted Answer

per isakson
per isakson on 26 Feb 2017
Edited: per isakson on 26 Feb 2017
"Reading and processing data from text file to matlab variable quickly" &nbsp The short answer is that using textscan to read and do most of the parsing is faster. And gives cleaner code.
It's a bit tricky to measure the speed of reading small files, since the file will be available in the system cache after the first test. However, it's safe to claim that in this case texdtscan is faster.
Run this
>> [nodes,edges,cac] = cssm();
Elapsed time is 0.054037 seconds.
Elapsed time is 0.009937 seconds.
>> cac(:)
ans =
{3001x1 cell}
{3001x1 cell}
where
function [nodes,edges,cac] = cssm()
tic
adjlist = regexp(fileread('sample_input.txt'), '\r\n', 'split');
adjlist(cellfun('isempty', adjlist)) = [];
nodes = regexp( adjlist, '\w*(?= )', 'match' );
% nodes = cell2mat(nodes);
% Error using cell2mat (line 52)
% CELL2MAT does not support cell arrays containing cell arrays or objects.
nodes = cat( 1, nodes{:} );
edges = regexp(adjlist, '(?<=( |,))\w*', 'match');
toc
tic
fid = fopen( 'sample_input.txt' );
cac = textscan( fid, '%s%*s%[^\r\n]', 'Delimiter',' ' );
[~] = fclose( fid );
toc
end
&nbsp
A more fair comparison:
>> [nodes,edges,n2,e2] = cssm();
Elapsed time is 0.047859 seconds.
Elapsed time is 0.014726 seconds.
>> edges{1}
ans =
'3' '5' '9'
>> e2{1}
ans =
'3' '5' '9'
where three lines are added to produce the data on the same format
function [nodes,edges,n2,e2] = cssm()
tic
adjlist = regexp(fileread('sample_input.txt'), '\r\n', 'split');
adjlist(cellfun('isempty', adjlist)) = [];
nodes = regexp( adjlist, '\w*(?= )', 'match' );
% nodes = cell2mat(nodes);
% Error using cell2mat (line 52)
% CELL2MAT does not support cell arrays containing cell arrays or objects.
nodes = cat( 1, nodes{:} );
edges = regexp(adjlist, '(?<=( |,))\w*', 'match');
toc
tic
fid = fopen( 'sample_input.txt' );
cac = textscan( fid, '%s%*s%[^\r\n]', 'Delimiter',' ' );
[~] = fclose( fid );
n2 = cac{1}; % new
e2 = regexp( cac{2}, ',', 'split' ); % new
e2 = reshape( e2, 1,[] ); % new
toc
end
  7 Comments
dpb
dpb on 1 Mar 2017
The final line of strsplit after all the preprocessing is
% Split.
[c, matches] = regexp(str, aDelim, 'split', 'match');
so guess it stands to reason it's going to be slower... :)
per isakson
per isakson on 2 Mar 2017
Edited: per isakson on 3 Mar 2017
"more efficient way to store strings of different lengths" &nbsp I guess, that there is no one-size-fits-all.
  • "efficient" regarding memory use and computational speed may conflict.
  • The number of strings to store
  • The variation in length of the strings as Walter pointed out.
  • Which operations will be done on the set of strings.
  • Whether or not strictly "write-once-read-many"
  • Does the cost of making the program/code count?
  • And more ... .
Regarding character arrays: "'first','second','third'" should be stored as
fst
ieh
rci
sor
tnd
d
since Matlab is column major. This is tricky to read when debugging.
I recently had a problem:
  • a fraction of a million valid Matlab variable names. Most names are short, but some are long. (No, I don't use them in expressions with EVAL.)
  • searches typically returns a dozen names
Solution:
  • store all names in one row separated by char(31), huge_str. char(31) is displayed as space by editors.
  • store the positions of char(31) to avoid repeated use of strfind(huge_str)
  • use STRFIND and REGEXP in searches
My resulting code is fast and memory efficient, but it did require some debugging.
Is this undocumented use of char(31), which might not survive next Matlab release? I don't think the use of char(31) is mentioned in the Matlab documentation.

Sign in to comment.

More Answers (0)

Categories

Find more on Characters and Strings in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!