convert files into matrix
Show older comments
hi,
I have 177000 files, I have to create matrix contain all values in these files.
Each file was split using textscan to get
c{1},c{2},........
then convert it into matrix.
Then convert these matrices into one matrix.
the problem is these files contain some similar values, so I have to specify the similar values ,and drew all other attached values(row) with these values.
I tried running with 100 files to know running time , I found out the running time is very long for just 100 files.
I think if I find function can compare among c{1}for all files, and among c{2} for all files ,...etc . I think that will save time. I'm facing problem with this code:
targetdir = 'd:\social net\dataset\netflix\training_set';
targetfiles = '*.txt';
fileinfo = dir(fullfile(targetdir, targetfiles));
k=0;arr(:,:)=0; inc=0;k=0;y=1;
for i = 1: length(fileinfo)
thisfilename = fullfile(targetdir, fileinfo(i).name);
f=fopen(thisfilename,'r'); f1=fscanf(f,'%c'); f1(1:2)=[];
f2=fopen(thisfilename,'w'); fprintf(f2,'%c',f1);
f3=fopen(thisfilename,'r');
c = textscan(f,'%f %f %s','Delimiter',',','headerLines',1);
c1=c{1};c2=c{2}; c3=c{3};z=1;z1=1;z2=1;z3=0;
for k=1+k:length(c1)+inc
no=c1(z); arr1=arr(:,1); p=find(arr1==no);
if isempty(p)
j=1;
arr(y,j)=c1(z); arr(y,j+1)=i; arr(y,j+2)=c2(z);j=j+3;y=y+1;
else
ind(i,z1)=p;
L=arr(p,:);len=0;
for h=1:length(L)
if L(h)~=0
len=len+1;
end
end
len;
arr(p,len+1)=i;
arr(p,len+2)=c2(z);
z1=z1+1;
end
z=z+1;
end
inc=inc+length(c1);
[u,u1] =size(arr);
end
f4=fopen('netfile.txt','w');
for i=1:u
for j=1:u1
fprintf(f4,'%d ',arr(i,j));
end
fprintf(f4,'\n');
end
fclose all;
thanks
1 Comment
huda nawaf
on 8 Nov 2011
Accepted Answer
More Answers (1)
Jan
on 9 Nov 2011
Some general advices for improving the speed:
- One command per line only - otherwise the JIT acceleration looses its power.
- Avoid dump commands as "len;" - it wastes time.
- Deleting the 1st two bytes from the file needs a lot of time. Better open the file, read two bytes and call TEXTSCAN afterwards.
- Close every file as soon as possible properly by fclose(fid). Do not leave all files open until the final fclose('all'). Open files consume resources.
- Use the vectorizing of fprintf. Instead of for j=1:u1, fprintf(f4,'%d ',arr(i,j)); end prefer fprintf(f4, '%d ', arr(i, :)).
- Counting the number of non-zero elements in L does not need a loop. Faster: len = sum(L ~= 0);.
- arr(:, :) = 0 is not useful, because it is equal to a = 0. k is defined twice.
I cannot insert a pre-allocation, because I do not know the maximal possible size of "arr". But this should be faster already:
function wwq
targetdir = 'd:\social net\dataset\netflix\training_set';
targetfiles = '*.txt';
fileinfo = dir(fullfile(targetdir, targetfiles));
arr = 0; % Better pre-allocate
inc = 0;
kk = 0;
y = 1;
for i = 1:length(fileinfo)
thisfilename = fullfile(targetdir, fileinfo(i).name);
f = fopen(thisfilename,'r');
fread(f, 2, 'uint8'); % Skip two bytes
c = textscan(f, '%f %f %s', 'Delimiter', ',', 'headerLines', 1);
fclose(f);
c1 = c{1};
c2 = c{2};
% c3=c{3}; % Not used
z = 1;
% z1 = 1; % Not used
% z2 = 1; % Not used
% z3 = 0; % Not used
kknew = length(c1) + inc;
for k = (1 + kk):kknew % Avoid k as counter *and* in loop index
no = c1(z);
p = find(arr(:, 1) == no);
if isempty(p)
arr(y, 1) = c1(z);
arr(y, 2) = i;
arr(y, 3) = c2(z);
% j = j+3; % Not used
y = y + 1;
else
% ind(i,z1) = p; % Not used
L = arr(p, :);
len = sum(L ~= 0);
arr(p, len + 1) = i;
arr(p, len + 2) = c2(z);
% z1 = z1 + 1; % Not used
end
z = z + 1;
end
kk = kknew;
inc = inc + length(c1);
u = size(arr, 1);
end
f = fopen('netfile.txt','w');
for i = 1:u
fprintf(f, '%d ', arr(i, :));
fprintf(f,'\n');
end
fclose(f);
Categories
Find more on Audio and Video Data in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!