how to read data from desired lines of a large data set?
Show older comments
Dear all, I want to read desired lines from a large data set(>50GB) which is not possible to load all the data by simply invoking textscan.
what I can think is:
fid = fopen('data.dat');
nline = 0; % the line index
wline = 1000: 10^7; % the wanted lines
i = 1; % index for wline;
while ~feof(fid)||nline<max(wline)
ldata = fgets(fid);
nline = nline+1;
if nline == wline(i)
datas(i) = ldata;
i= i+1;
end
end
as you see, this loop is really time consuming. my questions is: 1. is there any function to read it faster (on Unix system) 2. is it possible to use pointer, so that just read the desired line
thank you
George
dataset 10^9 lines and 4 columns
0 0 0 0.5
0 0.05 200.05 1 ...
Answers (1)
That is one big chunk of data. I have several suggestions:
- Preallocate: in your code your are growing datas at each iteration. Preallocate using, e.g.
datas = ones(numLines,5);
This might not be a viable option if you want to allocate for a 10^9 x 5 matrix.
- Split your data in several chunks, that you can read when needed. Look at the split utility
- Use a database.
If you want to read just one line, and know the exact position (in bytes from the beginning), you could always try fseek.
Categories
Find more on Language Support in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!