Textscan doesn't work on big files?

I'm currently using the latest Matlab version on 16 GB RAM Mac.
I tried to perform a splitting of a really big cube file (100 GB) into smaller cube files with only 210151 lines per file using this code:
%% Splitting
% opening the result.cube file
fid = fopen(cube) ;
if fid == -1
error('File could not be opened.');
end
m = 1 ;
while ~feof(fid)
% skip the alpha and beta density
fseek(fid,16596786,0) ;
% copy the spin density
text = textscan(fid,'%s',210150,'Delimiter','\n','Whitespace','') ;
% Prints the cube snap shot to the subdirectory
name = string(step_nr(m))+'.cube' ;
full_path = fullfile(name1,name) ;
fid_new = fopen(full_path,"w") ;
fprintf(fid_new,'%s\n', text{1}{:}) ;
fclose(fid_new) ;
m = m+1 ;
end
fclose(fid) ;
save("steps","step_nr")
end
My problem is: Apparently, textscan is not suited for this kind of files. I also tried with line-by-line copying with fgetl, which on the other hand takes ages for a file of 100 GB. Is there a more efficient way to split the file?
I've read about fscanf and tried this:
tic;
fid = fopen('result.cube');
fgetl(fid) ; fgetl(fid) ;
f = fscanf(fid, '%d %f %f %f', [4 4]) ;
s = fscanf(fid, '%d %f %f %f %f', [5 192]) ;
n = fscanf(fid, '%f %f %f %f %f %f', [6 209953]) ;
fid_new = fopen("new",'w') ;
fprintf(fid_new, '%d %.6f %.6f %.6f\n', f) ;
fprintf(fid_new, '%d %.6f %.6f %.6f %.6f\n', s) ;
fprintf(fid_new, '%f %f %f %f %f\n', n) ;
fclose(fid) ;
t=toc
But my problem here is: `s` is not aligned in the individual file like in the big file. `n` is in decimals instead of for example E-02. I also tried to copy it line by line but it takes years. Any suggestions how to improve this? I want it to look like this:

2 Comments

Is your goal to split the file or is your goal to work with the data in MATLAB? If the latter, some of the Large File and Big Data functionality available in MATLAB may be of use to you.
My goal is actually just splitting a really huge file into smaller ones. Afterwards, I want to deal with them individually.

Sign in to comment.

Answers (1)

Harald
Harald on 22 May 2024
Hi Oscar,
please attach a sample data file (1 MB will be plenty) so that we can reproduce any issues.
What problem do you encounter with the textscan approach? One issue I suspect: While textscan usually resumes where the previous textscan command left off, you always use fseek to move to the same point again. It seems you should place the call to fseek outside of the while loop.
For block reading, I would usually resort to datastores. If the data is of tabular format, I would specifically use
Best wishes,
Harald

9 Comments

I just attached the file. So my problem with textscan ist the alignment. If I run the code as above and print it to a new file, the second block of the data is not aligned. The third block originally has decimals wirtten like "0.173526E-01" but afterwards is presented as "0.017352". Is it possible to preserve everything from the original file?
fseek is inside the loop because the file structure is as follows: First numeric data about A at step 1, then numeric data about B at step 1, then again data about A but at step 2 and so on. I actually don't need data about A, which is why I periodically skip forward and just retrieve data of B.
Hi,
since you read entire lines as strings, I don't see why you would have alignment or formatting issues with the textscan approach. When I run this code with your file, the (due to the small sample only one) generated file reads the same way as the original file.
%% Splitting
% opening the result.cube file
fid = fopen("topcube.txt") ;
if fid == -1
error('File could not be opened.');
end
m = 1 ;
while ~feof(fid)
% skip the alpha and beta density
fseek(fid,16596786,0) ;
% copy the spin density
text = textscan(fid,'%s',210150,'Delimiter','\n','Whitespace','') ;
% Prints the cube snap shot to the subdirectory
name = "top"+m+".cube" ;
fid_new = fopen(name,"w") ;
fprintf(fid_new,'%s\n', text{1}{:}) ;
fclose(fid_new) ;
m = m+1 ;
end
fclose(fid) ;
%% Compare files
isequal(fileread("top1.cube"), fileread("topcube.txt"))
If you can confirm that there is no problem with the supplied sample file, we will need a sample file for which there is a problem.
Best wishes,
Harald
The Problem is that the error 'out of memory' occurs when I run the code with textscan. I tried it with a few 100 MB file and it works but my final 100 GB file can't be processed.
That's crucial information. If you have previously mentioned it, I must have missed it.
Since you read one block at a time and the size of the block remains constant, I do not see why this would not work with big files.
You could include code like this to see how memory usage evolves:
T = struct2table(whos);
sum(T.bytes)
Where should I put it?
I would put it before the end statement of the while-loop. You may also want to assign the output of the second line to a variable, such as
memUsage(m) = sum(T.bytes);
If it turns out that variables are growing, the question is which ones. text should really not grow because you are overwriting it, and the size of a block should remain the same.
Oscar Perez
Oscar Perez on 24 May 2024
Edited: Oscar Perez on 24 May 2024
Matlab stops exactly at textscan and displays "out of memory". Already at the first loop.
Ok. Can you try with a smaller number of rows (say 20000 or 2000) to see what the memory usage is?

Sign in to comment.

Products

Release

R2024a

Asked:

on 22 May 2024

Commented:

on 24 May 2024

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!