MATLAB Answers

0

Textscan to read CSV File (Mixed Format)

Asked by abhay kaul on 12 Dec 2014
Latest activity Commented on by per isakson
on 26 Apr 2015
I am trying to read multiple large CSV files (200,000 X 200 each) into MATLAB 2013a 64bit which contains numbers and strings. The order of numbers and strings is not fixed between files. I am using "Textscan" function. If I use the "%f" format then "textscan" stops execution as soon as it encounters the string so instead I ended up using "%s". But, using "str2double" to convert to numbers takes a long time. "str2double" also converts any non-numeric entry in the data to NaNs and I like using it but the overhead is too much.
I am wondering if anyone else has found a work around for this. I would ideally prefer using "textscan" since the code queries user to specify the line at which column names are in the CSV file and also some lines are commented in the CSV file and textscan is ideal for this file type. But any ideas are appreciated.
Thanks

  1 Comment

Did you ever try to read these files with importdata?

Sign in to comment.

2 Answers

Answer by Image Analyst
on 12 Dec 2014

Apparently you need to inspect the line of text first to determine if it's a comment or not. So use fgetl() to read a string from the file. Then call strtrim() to get rid of any leading whitespace. Then check if the first character in non-numeric, a % symbol, or whatever else indecates a comment in your file. Then use sscanf() or textscan() to extract the numbers into a numerical array.

  1 Comment

Just a clarification on my question. The issue is not to identify if the line that has a comment or not. Textscan does that for me. The data I have has some columns that are numeric whereas some columns have all strings. But the order of columns which are numeric or string is not fixed among multiple files. I should have provided that information in my original questions. Thanks.

Sign in to comment.


Answer by Star Strider
on 12 Dec 2014

Please post one of your files or a representative sample of one that demonstrates the problem.
As an interim possibility, see if this approach works:
fidi = fopen('filename.csv');
d1 = textscan(fidi, '%f %f', 'Delimiter','\n', 'HeaderLines',11, 'CollectOutput',1);
fseek(fidi,0,0);
d2 = textscan(fidi, '%f %f', 'Delimiter','\n', 'HeaderLines',11, 'CollectOutput',1);
You may have to put that in a loop and subscript ‘d1’ and ‘d2’ and others appropriately. (This is archive code that worked for the file it was intended to read.) You will have to adapt it to your file structure, including an initial 'HeaderLines' option if necessary.
The code reads starting with the initial string and the numerical data following until it gets to the next string (here assumed to have the same structure) where it stops. The fseek function automatically re-starts textscan reading the file at the correct position at the second textscan call. It then reads that string and the subsequent data until it gets to the end.

  2 Comments

Here is an example file.
File 1
# File 1
# This line should be ignored
ParaA, ParaB, ParaC, ParaD
ab, 1,2,3
bc, 4,5,6
%%%%%%%%%%%%%%%%%
File 2
# File 2
ParaM, ParaN, ParaO, ParaP, ParaQ
# this line should be ignored
1,2,ab,4,12:40:50
3,4,cd,5,12:40:51
Please see that I am able to read all these file types properly with Textscan if I specify data type to be string. But the overhead of str2double is too high. So, I was wondering if there is some way to read all as %f and have MATLAB store non-numeric columns as NaN.
Thanks
You would likely need to read them in with different format strings for each file. The first one might be:
'%s%f%f%f'
the second:
'%f%f%s%f%s'
reading the time in as a string, since it probably is. I see no way around that otherwise. You could parse the inputs to store only the numeric data and ignore the strings after you read each file.

Sign in to comment.