Load parts of VERY LARGE text file content and create a smaller matrix

I have a very large file. Sample format is attached. There is a header with comment marks $$.
The rest of the data begins from Start 1, Pos #, , followed by two columns of data the length
to Start 2, Pos # etc. Note that the columns after Pos # is NOT fixed.
The length of the two columns after Start #, Pos # ranges between 100 to around 500,000.
The Scan # ranges from 1 to around 4000.
I want to be able to read in sequentially the two columns after each Start 1, Pos# to just before Start 2 and Pos # and then move on to Start 2, Pos # etc.
I have tried textscan with block size but this is not working well.
It is not possible to load all data directly into Matlab.
Any directions will be greatly appreciated.

6 Comments

"Note that the columns after Pos # is NOT fixed."
What does this mean?
The number of rows of data after Start 1, Pos 0.026 is NOT always the same as the number of rows of data after Start 2, Pos 0.043 all the way to the last Start #, Pos # so the number of rows are not Fixed for the whole data.
Please look carefully at the image I attached. Many thanks.
That's of no help; this is a two-way street. You want help, clarify the problem PRECISELY.
That's an image not the actual file; looks fixed column to me unless you can say something different, what's the problem with it that it's mentioned specifically?
Do you want/need those values as well?
Hi dpd, thanks for your time. I will like to clarify further.
1. I have very many large files from an instrument. Typically a file is about around 2GB
2. A simplified generic layout of a file is attached to this comment for you. It is a tab delimiter file.
3. I want to load all the rows of the column data after each [Start #; Pos #] header BUT I do not know in advance the number of rows after the Start and Pos.
4. The data I want is between [Start 1, Pos 0.026] and [Start 2, Pos 0.043] which is [4 rows by 2 columns]
5. The next block of data I want is between [Start 2, Pos 0.043] and [Start 3, Pos 0.105] which is [5 rows by 2 columns]
6. The next block data I want is between [Start 3, Pos 0.105] and [Start 4, Pos ##] which is [8 rows by 2 columns]
etc. etc.
My problem is that I do not know the number of rows in each block (that is 4, 5, 8, ...) in advance as they are not fix sizes.
I want to know if there is a way to extract each block of the two column matrix data? I have tried using the Start locations as markers but I have not been successful yet through textscan, block.
I can work with the extracted block of data sequentially and move on to the next block, then the next in a loop if I can extract them.
I run out of memory if I try to textscan in all the data so that I sort the blocks out after.
I hope that this is clearer.
Thanks again. John
OK, that's a big step forward...I've gotta' run and finish up the evening chores now but I'll try to take a look at it later on this evening. My first hunch is can make a textscan call work ok since you can process by grouping but I'll have to 'spearmint to test the hypothesis...altho the basic idea is once you get to the beginning of the first section you then do an unterminated read of the floating point data; textscan will convert until it errors on the next section. Then you trap the error and get the next character line to reset the file pointer to a clean record and repeat. "Rinse and repeat" until feof.
As say, one generally has to test these things on a given file to work out the nitty, but the above tactic generally works as a tactic.

Sign in to comment.

 Accepted Answer

OK, stuff's taken care of and I'm in for the evening (we're back to family farm; retired from the consulting gig so this is my fun at keeping hand in a little). Anyway, the basic outline is--
>> fid=fopen('vlarge.txt');
id=cell2mat(textscan(fid,'Start %d','headerlines',5)); % first section ID
pos=cell2mat(textscan(fid,'Pos %f')); fprintf('\n Section %d\n', id),
dat=cell2mat(textscan(fid,'%f %f')); fprintf('%.3f %d\n',dat.')
% process first section here
while ~feof(fid) % w/ the header out of the way, do rest of file...
id=cell2mat(textscan(fid,'Start %d'));
pos=cell2mat(textscan(fid,'Pos %f'));
dat=cell2mat(textscan(fid,'%f %f'));
fprintf('\n Section %d\n', id),fprintf('%.3f %d\n',dat.')
% proces subsequent sections here, of course...
end
fid=fclose(fid);
Section 1
100.037 0
118.979 0
118.983 1
118.987 5
Section 2
100.037 0
100.966 0
100.969 1
100.973 0
121.007 7
Section 3
100.037 8
100.966 0
100.969 1
100.973 0
121.007 0
141.040 0
161.074 20
181.107 0
>>
As you see, you're lucky with the blank line in the file that terminates the translation and that all you need for the indeterminate section lengths is the two fields to return the array in the right shape. Note I also went ahead and cast the cell output from textscan to an ordinary array at the time of the read; I almost always do this unless there's some specific reason for needing a cell array.

7 Comments

Hi dpb, This looks great. I will test and send feedback tomorrow.
Many many thanks for your time.
I tried to put something together but I can see yours is compact and less iterative.
John
Hi dpb,
I have tested the code and it worked very well on my sample data format. On a typical data I think the header is not being read on line
id=cell2mat(textscan(fid,'Start %d','headerlines',19)); % first section ID.
It returns
>>id =
Empty matrix: 0-by-1
Note that I have replaced 5 with 19 (not sure if this is correct). Also, Pos # is actually Pos Location # and FUN 1 is FUNCTION 1
I have attached a typical header layout to guide your modifications.
I am indeed very amazed at the simplicity of your code. I am sure if I get past the headerline properly the rest will work with little or no modifications.
Thanks in advance
John
PS: By the way I am also thinking of doing some farming very soon.
Please note that Start=Scan in the headings Thanks
Hi dpd
Many thanks for your efforts. I have found the error. The Scan %d headerline value should be 35. (got it through investing with fgets
Your support is very much appreciated.
Best regards,
John
PS: I may come back to you for farming advice by email.
"...I have found the error. The Scan %d headerline value should be 35."
Good work, glad to hear you got it working. You can also mix the two of using fgetl first to parse the beginning of the file line-by-line to find the first data section and the switch to textscan in cases where the header may not always be a consistent length. They use the same file handle so there's no issue there...
I noticed exactly what you said. Somehow there are changes to header length in some files so I provided a window to look into by try and catch and fopen and fclose and it is OK so far but the loop will be more stable and effecient. Thanks for the suggestion.
Yes, you always want to process in as large of blocks as possible; the line-by-line parsing is gare-on-teed to be slow for large files and is to be relied on only when there's no other way.
It is, however, appropriate for the leading header to simply look for the beginning of the data section when there's a variable number of lines and no data at the beginning of the file that encodes that to be able to compute the 'headerlines' parameter value. It'll be a little slower than being able to use the header count of course, but since it's only done once it won't be a killer. One can refine the search if one knows there's some minimum number of lines and then possibly more by not doing any testing until that minimum number have been read and all sorts of other fancier things for any specific file of course, including up to reading a sizable chunk of a file into memory as a character array image and doing the searches in memory, then reposition the file for the actual scan/conversion...

Sign in to comment.

More Answers (1)

Use fopen to open the file then parse it line by line saving what you need and ignoring the rest. Remember to close the file with fclose as well.

4 Comments

I have used fopen and textscan and strcmp to locate the the string 'Scan' indexes in a block for the whole data and the use the index values to capture the data between Scan 1 and Scan 2 then Scan 2 snd Scan 3 etc.
Any specific codes along these lines will be much appreciated. Many thanks.
"... used fopen and textscan and strcmp to locate the the string 'Scan' indexes in a block for the whole data..."
So you were able to read the entire file into memory? Your earlier posting said you weren't able to do so? If can, that simplifies things a bunch.
Show your actual code and again, "clarify, clarify, clarify!" We only know what you tell us; we can't see your workstation from here nor know what you have/have not done that is clear to you those results.
Use fgetl to read each individual line.
Hi Robert,
Many thanks for your efforts. I have found a solution on the forum. I really want to thank everyone who contributed their time towards my question.
Best regards,
John

Sign in to comment.

Products

Asked:

on 11 Mar 2015

Commented:

dpb
on 14 Mar 2015

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!