read in only certain columns of big text file
Show older comments
I want to read in only certain columns of a big tab-separated values text document (.tsv) in the form of a table with 900 columns and 100 lines. Every line in the textfile has the format:
8 columns %s
112 repetitions of: %d%d%d%d%d%d%s
100 columns: %d (irrelevant to me)
How can I import the table as a cell array of dimension (number of columns)*(number of lines) (or vice versa) without having to specify all the 900 format specifier (& skipping the last 100 columns)?
In fact, I only need every 8+6*k column and every 8+6*(k+1) column for k=1:112, i.e. the last 2 elements (%d%s) of the sequence repeated 112 times.
Using textscan(fileid,'%s','Delimiter','\t'); gives me a cell array of size 1*(number of total elements) instead, which is not very practicle to deal with if I want to use certain columns. Also, I didn't know how to solve the format specifier issue and simply read everything as strings.
Using readtable('filename.tsv','Delimiter','\t'); gives me the error message: Undefined function 'readtable' for input arguments of type 'char'.
2 Comments
Image Analyst
on 26 Mar 2014
readtable is only with R2013b and later - you probably have an earlier version.
Accepted Answer
More Answers (2)
Simon
on 24 Oct 2013
Hi!
I'll try an answer for my short example:
1 2 s 4 5 a 5
a 4 c d 6 f i
9 8 d g 10 i n
(all tab separated!)
I want to extract columns 2 and 5 as double and 3 and 6 as string (you need to adapt ind1 and ind2 to your needs)
% read in file with textscan
fid = fopen('tabdata.txt');
FC = textscan(fid,'%s','Delimiter','\n');
fclose(fid);
FC = FC{1};
% split all lines (attention: the last column is not processed!!!)
FCsplit = regexp(FC, '(\S*)[\s]', 'match');
% columns to extract as double
ind1 = 2:3:5;
% columns to extract as string (right neighbor of column of doubles)
ind2 = ind1 + 1;
% prepare result arrays for double and string (as cell array)
D = zeros(length(FCsplit), length(ind1));
S = cell(length(FCsplit), length(ind1));
% loop over all columns to extract
for n = 1:length(ind1)
D(:, n) = cellfun(@(x) str2double(x(ind1(n))), FCsplit);
S(:, n) = cellfun(@(x) strtrim(x(ind2(n))), FCsplit);
end
I don't know about the performance for large files. Try it out!
Ken Atwell
on 24 Oct 2013
You can use repmat avoid typing '%*d%*d%*d%*d%*d%d%s' 112 times (also note the use of '*' to skip importing certain columns. Try something like this:
formatSpec = ['%*s%*s%*s%*s%*s%*s%*s%*s' repmat('%*d%*d%*d%*d%*d%d%s', [1,112]) '%*[^\n]'];
textscan(fileId, formatSpec, 'Delimiter', '\t')
2 Comments
Ken Atwell
on 24 Oct 2013
Test code with a dummy file:
of = fopen('test.tsv', 'w');
for i = 1:100
fprintf(of, 's1\ts2\ts3\ts4\ts5\ts6\ts7\ts8\t');
for j=1:112
fprintf(of, '%d\t%d\t%d\t%d\t%d\t%d\tstr\t', j+0, j+1, j+2, j+3, j+4, j+5);
end
for j=1:100
fprintf(of, '-1\t');
end
fprintf(of, '\n');
end
fclose (of);
fileId = fopen('test.tsv');
formatSpec = ['%*s%*s%*s%*s%*s%*s%*s%*s' repmat('%*d%*d%*d%*d%*d%d%s', [1,112]) '%*[^\n]'];
T = textscan(fileId, formatSpec, 'Delimiter', '\t');
fclose (fileId);
Charlotte
on 29 Oct 2013
Categories
Find more on Text Files in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!