Find data from files that are too large to read in
Show older comments
I have structured data files (each about 30 GB). I need to find all the lines in the file that contain a specific number in one of the fields. I am presently doing this by reading in each line in turn and checking the field, but it takes a long time ( > 1 hr) to scan through the file). The program HEX FIEND allows me to do this manually in a small fraction of the time. Is there a way to read a file up to the point that some condition is met? If there is, I suspect it will speed up finding and extracting the lines of the file I want.
2 Comments
Kevin Lehmann
on 20 Feb 2024
Answers (2)
Walter Roberson
on 17 Feb 2024
0 votes
Use buffer-fulls of data for increased efficiency.
fread() a block of data of fixed size. Scan backwards through the block looking for the last newline, keeping a count of how far you go. truncate the block there, and fseek() backwards by the number of bytes you had to scan backwards to reach the newline. Now process the in-memory block of data.
Repeat until you are at the end of file. Be careful because the file might potentially not end in newline.
10 Comments
Kevin Lehmann
on 17 Feb 2024
Walter Roberson
on 18 Feb 2024
1 gigabyte buffer is probably fine.
Kevin Lehmann
on 20 Feb 2024
Walter Roberson
on 20 Feb 2024
In all modern file systems, ASCII files and binary files are just streams of bytes. ASCII files use either linefeed or carriage-return followed by linefeed to signal the end of a line.
There is no reason you cannot fread() a block of data from an ASCII file. The only consequence is that the end of the block of (fix-length) data might not happen to end in a newline. So you scan backwards from the end of the block looking for the first newline, truncate the block there, and fseek() backwards by the number of bytes you moved backwards.
The result will be a block of characters that has internal newlines (and possibly carriage-returns as well) marking the end of lines. You can process that block as text by any of several different methods, including textscan
fid = fopen('sample.txt');
txt = fread(fid,[1 Inf],'*char');
fclose(fid);
class(txt)
disp(txt)
Kevin Lehmann
on 20 Feb 2024
Walter Roberson
on 20 Feb 2024
data = fread(FILEID, [37 25000], '*uchar').'; %about 1 gigabyte
%break it up into groups
first_group = data(:,1:5, 'evaluation', 'restricted');
second_group = data(:,6:7, 'evaluation', 'restricted');
Les Beckham
on 20 Feb 2024
@Walter Roberson, did you, perhaps, mean this?
data = fread(FILEID, [37 25000], '*uchar').'; %about 1 gigabyte
%break it up into groups
first_group = str2num(data(:,1:5), 'evaluation', 'restricted');
second_group = str2num(data(:,6:7), 'evaluation', 'restricted');
Kevin Lehmann
on 21 Feb 2024
Walter Roberson
on 21 Feb 2024
Ah, yes, I did mean that!
Categories
Find more on Large Files and Big Data in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!