I am trying to read rather large text files and post process them for relevant data in MATLAB. The order of size of the text file will be about 20-50 GB on average and I unfortunately have no control over the formatting of the file as it is an output from another program. The text file which is output contains large amounts of whitespace, text and other non-relevant data but does have a consistent structure to it that I have been able to decond and extract the relevant numerical information from for files with smaller sizes that can fit into memory. But now I have to make it work on a larger scale.
I cannot share the format of the file as it is restricted but I can describe it. The file generated can be delimited by the page-break delimiter (char(12)) and then each page has a specific format depending on the information it contains. Essentially my current approach does the following:
1) Read the text file in: A = fileread(File)
2) Split the file into its pages via P = regexp(A,char(12),'split')
3) Loop through each page found and use further splitting commands to extract needed numerical data and organize it
4) Output a data structure (MATLAB struct) of organized data from the function
This works well so far but I cannot get the file to read in for larger files (first step) because of out of memory errors. After doing some searching it seems like I may be able to use a datastore or tall array to somehow get past this but I am unsure if this will be scalable or I should try a different approach. Can someone make a suggestion? Is the current function scalable by simply converting to usage of tall arrays.
As a side note to the use of datastore, the text file is NON-TABULAR data if that is relevant.