How to find strings in a very large array of data?
Show older comments
Hi
I have a csv file containing a large number of numbers and a few random strings like 'zgdf'. I need to find them and set them to zero. I cannot use 'csvread' (due to strings), so I use 'textscan' to read the file.
I then turn the data to digits using str2double. MATLAB then turns the string values to NaN which is fine for me, but it takes a long time, specially because this has to be done for many similar files.
Any faster method to sort this out?
This is how I read the data (original file has two columns and large number or rows):
fileID = fopen(filename);
C = textscan(fileID,'%s %s','Delimiter',',');
fclose(fileID);
for i = 1: length (C{1})
D(i) = str2double(C{1}{i});
end
Thanks
10 Comments
Jan
on 20 Nov 2019
Please explain how long it takes and which acceleration you need. Maybe the process is limited by disk and Matlab is not the bottleneck? If you post your current code, a matching solution is more likely. A short and meaningful example for the inputs would be useful also. Is te number of columns known in advance?
Steven
on 20 Nov 2019
B in your code appears to be a cell array of cell arrays. Is that the output to textscan()? What formatSpec are you using in your call to textscan()?
As Jan mentioned, a simple, abbreviated example of B and B{1} would answer many questions.
See the update in my answer. If my answer and the updated suggestion do not work, please provide more detail about B.
Steven
on 20 Nov 2019
Adam Danz
on 20 Nov 2019
Both methods in my answer should work. Have you tried them?
Ridwan Alam
on 21 Nov 2019
If you have a known and fixed set of noise, say {"zgdf", "cvbn"}, you could have used the "TreatAsEmpty" option with textscan(). But I believe that's not the case. Sigh!
Knowing your matlab relase is usually helpful which is why it's included as an optional field when you're forming a question in this forum.
I've confirmed that the loop method of str2double() is indeed faster than the direct application to the cell array. Sometimes loops are faster.
See method 3 in my answer which applies your sscanf idea and avoids the error you described.
See method 4 for a FEX function that is like str2double() but much faster.
Method 5 is very fast but requires r2019a.
Lastly, whenever you build a variable within a loop, always pre-allocate the variable. Not pre-allocating the variable will definitely slow down your code.
Ridwan Alam
on 21 Nov 2019
Edited: Ridwan Alam
on 21 Nov 2019
@Steven
I have updated my answer with the syntax for textscan with "TreatAsEmpty" option. It returns NaN in place of those known noisy chars. Using the ["EmptyValue",0] option will return 0 instead of NaN.
Not sure how much speed up will that do though :(
Accepted Answer
More Answers (2)
Ridwan Alam
on 20 Nov 2019
Edited: Ridwan Alam
on 21 Nov 2019
Given, the list of noise is {'a', 'b', 'ee'}:
C = cell2mat(textscan(fileID,'%f %f','Delimiter',',','TreatAsEmpty',{'a','b','ee'},'EmptyValue',0));
Try this!!
%% Old Answer
Updated using Method 1 from Adam:
C = textscan(fileID,'%s %s','Delimiter',',');
C = [str2double(C{1}) str2double(C{2})];
C(isnan(C)) = 0;
9 Comments
Ridwan Alam
on 20 Nov 2019
Edited: Ridwan Alam
on 20 Nov 2019
Thanks. As Steven mentioned "I need to find them and set them to zero", I was under the impression that running a loop to find the Nans was taking the time.
Adam Danz
on 20 Nov 2019
Your lines of code will definitely solve that part of the problem! :)
Ridwan Alam
on 20 Nov 2019
I assumed the rest to be trivial ;-)
Hmmmm, I tested the TreatAsEmpty idea using the attached file and didn't get expected results.
Ridwan Alam
on 21 Nov 2019
I got this:
C = cell2mat(textscan(fileID,'%f %f','Delimiter',',','TreatAsEmpty',{'sdfs','1 sec'},'EmptyValue',0));
% C =
%
% 1 0
% 2 0
% 3 2
% 0 3
% 3 0
% 0 3
% 3 0
% 3 3
% 3 3
% 3 3
% 3 3
% 3 3
% 0 3
% 3 0
% 3 3
% 3 3
% 3 3
% 3 3
% 3 3
% 3 3
Adam Danz
on 21 Nov 2019
Right, if you know the strings in the file ahead of time you can list them in the TreatAsEmpty value. I assume the strings are not known prior to reading in the file.
Steven
on 21 Nov 2019
Ridwan Alam
on 21 Nov 2019
Sure, Steven. Please vote up if you liked the conversation. Thanks!
per isakson
on 21 Nov 2019
Edited: per isakson
on 23 Nov 2019
"random strings like 'zgdf'" If that means letters of the US alphabet, this code is rather fast.
%%
chr = fileread('cssm.txt');
chr = regexprep( chr, '[A-Za-z]+', '0.0' );
cac = textscan( chr, '%f%f', 'Delimiter',',', 'CollectOutput',true );
num = cac{1};
result
>> num(1:10,:)
ans =
0.81472 0.15761
0 0.97059
0.12699 0.95717
0.91338 0.48538
0.63236 0.80028
0.09754 0.14189
0.2785 0
0.54688 0.91574
0 0.79221
0.96489 0.95949
Where cssm.txt contains
0.81472, 0.15761
abc , 0.97059
0.12699, 0.95717
0.91338, 0.48538
0.63236, 0.80028
0.09754, 0.14189
0.27850, def
0.54688, 0.91574
zgdf , 0.79221
0.96489, 0.95949
et cetera
In response to comments
See the caveat in the first line of my answer.
I fail to find a regular expression for "not a legal number" and if one exists it might not be that fast.
It's straight forward to add a few (many becomes impractical) characters, e.g. '^â', and make sure that the string is followed by comma or end of line.
>> chr = regexprep( '12.3, abc, g^â, 1.0e5, def ', '(?m)[A-Za-zâ^]+(?=\x20*\r?(,|$))', '0.0' )
chr =
'12.3, 0.0, 0.0, 1.0e5, 0.0 '
>>
Look ahead, e.g. '(?=\x20*\r?(,|$))', is reasonable fast, but look behind sometimes ruins the performance.
The above regex fails for 'def1', '1deg' and '10a'
fileread in combination with CRLF as newline character poses a problem when using regular expressions. The anchor $ doesn't recognise CRLF as newline. (Please tell me if I missed something.) The best way to avoid this problem is to replace fileread by a function that uses
[fid, msg] = fopen( filespec, 'rt' );
chr = fread( fid, inf, '*char' );
5 Comments
Ridwan Alam
on 21 Nov 2019
Cool!!
Jan
on 21 Nov 2019
What about 1.0e5 ?
per isakson
on 22 Nov 2019
Edited: per isakson
on 22 Nov 2019
I added a response to my answer.
Categories
Find more on Large Files and Big Data in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!