Way of conserving memory when extracting data from CSV

Hi everybody I have few questions. I have some HUGE CSV files which I need in Matlab for analysis. The CSV it self has 5 columns. The columns of relevance are:
Column 1 is our date starting from early 2007 all the way till till mid 2011 in the form of mm/dd/yyyy.
Column 3 is our respective prices
Column 5 is the number of trades.
The questions I have are these:
1) How can I extract these 3 columns into a Matrix in MATLAB without taking too much memory (bear in mind that some of these CSV files have around 60 million rows)? Is there a way to decrease the memory of each cell Matlab allocates for the matrix? Please help with code.
2) How can I extract all the information into a non-string matrix (for analysis) for a specific year....ie only for 2009. So I would require to store in Matrix all information for 2009 (bearing in mind the memory limitations in 1).
Thanks so much.

13 Comments

If memory is an issue, you might want to parse the file line-by-line (using fgetl() or similar). That way, you don't need to read the entire file into memory at once. Also, by retrieving one line at a time, you can easily implement your year=2009 inclusion criterion.
Please provide a sample. Type a few lines in the Matlab editor.
  • For 1) according to my last question (ignoring column 4)*
fid = fopen('CT.csv', 'rt');
datacell = textscan(fid, '%s%s%f%*s%f', 'Delimiter', ',');
fclose(fid);
dates = datenum(datacell{1}, 'mm/dd/yyyy');
outtrades = datenum(datacell{2}, 'MM:SS:FFF');
prices = datacell{3};
ntrades = datacell{4};
datamatrix = horzcat(dates, prices, ntrades);
Now.... for 1) the modifcation we would need is we can also ignore column 2, but also.....is there a way where we can save memory in matlab (ie assign each matrix cell less bytes?). I know for a larger CSV file than I am using I will have trouble....
I cannot see any reason why I should guess and create a sample cvs-file!
If you want sample of CSV file:
Column 1: 04/29/2008 04/29/2008 04/29/2008 04/29/2008 04/29/2008 04/29/2008
Column 2: 38:52.0 38:53.0 38:56.0 38:56.0 38:56.0 38:57.0
Column 3: 71.35 71.35 71.35 71.35 71.35 71.35
Column 4: CTN08 CTN08 CTN08 CTN08 CTN08 CTN08
Column 5: 2 2 3 1 1 1
Please bear in mind we only need column 1,3,5 in our matrix in Matlab
Does the file look like this?
04/29/2008,38:52.0,71.35,CTN08,2
04/29/2008,38:53.0,71.35,CTN08,2
04/29/2008,38:56.0,71.35,CTN08,3
04/29/2008,38:56.0,71.35,CTN08,1
04/29/2008,38:56.0,71.35,CTN08,1
04/29/2008,38:57.0,71.35,CTN08,1
What are the maximum and minimum values of column 3 and 5, respectively?
minimum values for both 3 and 5 is 0 as 3 represent Price (cant be negative) and 5 represents volume (which also cant be negative). The maximum varies. As I need to program to work on many CSV files, and I cant access many of the CSV files (as the max rows allowed in excel is 1 million), therefore I cant tell you maximum for both 3 and 5... Do you need to define a maximum?
See the answer of ImageAnalyst. It is wasteful to store the numbers in double float (8 byte).
Maybe, volume can be stored in uint8.
>> intmax('uint8')
ans =
255
The maximum determines the smallest data class you can use. See code in my answer.
In this case volume (column 5) the maximum would never exceed 2500 (5000 to be sure)
Column 3 the maximum would never exceed 250-350
And price will never exceed
>> intmax('uint32')
ans =
4294967295
cents ????

Sign in to comment.

 Accepted Answer

Something like this will do it
function mate2u
day_number = zeros( 60*1e6, 1, 'uint16' ); % day_number = 1 for 1/1/2007
price = zeros( 60*1e6, 1, 'uint32' ); % 1/100 of cents
volume = zeros( 60*1e6, 1, 'uint16' ); % volume
pivot_day = datenum( '1/1/2007', 'mm/dd/yyyy' );
chunk_size = 10; % choose 5*1e6
fid = fopen( 'mate2u.txt' );
while not( feof( fid ) )
cac = textscan( fid, '%s%*s%f32%*s%u16', chunk_size, 'Delimiter', ',' );
uint16( datenum( cac{1}, 'mm/dd/yyyy' ) - pivot_day )
uint32( cac{2}*10000 )
cac{3}
end
fclose( fid );
end
where mate2u.txt is
04/29/2008,38:52.0,71.35,CTN08,2
04/29/2008,38:53.0,71.35,CTN08,2
04/29/2008,38:56.0,71.35,CTN08,3
04/29/2008,38:56.0,71.35,CTN08,1
04/29/2008,38:56.0,71.35,CTN08,1
04/29/2008,38:57.0,71.35,CTN08,1
prints to command window
ans =
484
484
484
484
484
484
ans =
713500
713500
713500
713500
713500
713500
ans =
2
2
3
1
1
1
>>

11 Comments

Hi Thank you.
Why are we only getting 6 outputs? Shoudlnt we bet getting millions of prices, volumes and dates?
To save on memory, it reads a "chunk" at a time. "N" in the Matlab help. textscan cannot convert the date from the string format to an integer, thus, two steps are needed.
Doc says:
C = textscan(fid, 'format', N) reads data from the file, using the format
N times, ...
OK sure....so does this program do 1) and 2) of the question..... if so how would I alter the code.....for instance...
Lets say I want columns 1,3,5 for 2010 (whilst conserving MATLAB memory)....would this program be able to do this? If so how (what modifications would I need to make for all of 2010 records to be stored in matlab)? Thanks Per Isakson.
Update: When I run the program.... I get cac = 5x1 cell [70.629997;70.610001;70.599998;70.559998;70.220001] [1;1;1;1;1]
but I have around 5 million inputs in this particular CSV file?
Did you see my comment in my answer? That explains the process, though it doesn't give code since I assume you know how to program.
It doesn't take that much memory.
"whilst saving memory for 2010" - only keep data for 2010 or what?
"Why are we only getting 6 outputs?" I made a sample text file with six rows as displayed in a comment above. However, I allocated memory for 6M rows.
Hi, Please see above, I have fixed my English.
Reading and converting text files is slow. If you want to read this data many times I would recommend that you read it once and store it in some binary format. I would choose HDF5.
In my sample code, there are these three lines
uint16( datenum( cac{1}, 'mm/dd/yyyy' ) - pivot_day )
uint32( cac{2}*10000 )
cac{3}
They should be replaced by assignment to day_number etc. There you could make any tests, e.g. if 2010 keep data, else continue
OK forget 2) of my question, lets focus on 1).....I'm sorry to be a bother....but when I run your code....I am unable to replicate valid results. I indeed create 0 matrixes of Daynumber, volume and price......but that's it .....my cac is 1x3 cell with 5x1 array/cell array. Is it possible I can get all of the data need in MATLAB using your above code?
" into a Matrix in MATLAB " to save on memory I have chosen different classes for the different variables. Thus, all three in one matrix is not possible.
One step at a time
  • make a copy of my six line text file (it's in my answer)
  • try to reproduce the result displayed in my answer
Yes, all data will fit in memory. Did you say how much RAM and which version of Matlab you have? I'm not sure you need the loop. However, the my code is just a start.
Hi Per Isakson.....your example works.....but here let me demonstrate some examples where it doesent work:
Input as shown from CTTEST20.txt:
01/03/2007,15:30:06.000,55.90,CTH07,0
01/03/2007,15:30:30.000,55.75,CTH07,0
01/03/2007,15:30:42.000,55.80,CTH07,0
01/03/2007,15:30:53.000,55.85,CTH07,0
01/03/2007,15:30:57.000,55.75,CTH07,0
01/03/2007,15:31:17.000,55.70,CTH07,0
01/03/2007,15:31:23.000,55.65,CTH07,0
01/03/2007,15:31:36.000,55.55,CTH07,0
01/03/2007,15:31:38.000,55.60,CTH07,0
01/03/2007,15:31:43.000,55.55,CTH07,0
01/03/2007,15:31:44.000,55.60,CTH07,0
01/03/2007,15:31:50.000,55.70,CTH07,0
01/03/2007,15:32:07.000,55.55,CTH07,0
01/03/2007,15:32:07.000,55.90,CTH07,0
01/03/2007,15:40:41.000,55.30,CTH07,0
01/03/2007,15:40:43.000,55.40,CTH07,0
01/03/2007,15:40:52.000,55.30,CTH07,0
01/03/2007,15:40:54.000,55.50,CTH07,0
01/03/2007,15:41:33.000,55.15,CTH07,0
01/03/2007,15:41:34.000,55.20,CTH07,0
Output in cac:
'01/03/2007' '01/03/2007' '01/03/2007' '01/03/2007' '01/03/2007' '01/03/2007' '01/03/2007' '01/03/2007' '01/03/2007' '01/03/2007'
55.599998 55.700001 55.549999 55.900002 55.299999 55.400002 55.299999 55.500000 55.150002 55.200001
0 0 0 0 0 0 0 0 0 0
As we can see we are missing 10 entries.....in our larger txt/csv files we get many more missing entries. Additionally look at the output prices...I am not sure why they are varying to the input prices (even if it is marginal)
Firstly, make some experiments with the [{}Code] button.
Secondly:
  • convert the script to a function (I've done it in my answer)
  • step through my code with the debugger and analyze what it does
  • notice that the twenty lines are indeed printed in the command window - two chunks of ten entries each
  • the prices are hurt by the single precision "%f32" - you could change f32 to f64

Sign in to comment.

More Answers (1)

What are the classes of each column? Are they all 8 byte (64 bit) doubles? For example, the number of trades might be able to be a 4 byte integer, and most of the floating point numbers could probably be single instead of double. By retrieving it a line at a time and using sscanf() you can place each value into the smallest type of variable that is appropriate for that number. For example, assuming no stock price is over $655.35 you could read in the number and multiply by 100 so that all stock prices are in cents rather than dollars. That way you can use 16 bit unsigned integer instead of a 32 bit single.
I don't have the toolboxes, but perhaps the Financial Toolbox or the Fixed Point Designer may have efficient ways of handling numbers like prices of stocks.
Like Matt said, perhaps you don't need all 60 million rows in memory at once - hopefully you can process it in chunks.

4 Comments

realmax('double')
realmax('single')
intmax('int32')
intmax('int16')
intmax('uint16')
intmax('uint8')
ans =
1.79769313486232e+308
ans =
3.402823e+38
ans =
2147483647
ans =
32767
ans =
65535
ans =
255
Regarding your comment above, They could both be uint16 then. That's 2 bytes instead of 8, so that saves you a lot - a factor of 4 in memory for those two columns.
Thank you....2 things....1) Price is in the form of 55.1500, does this make a difference?
Additionally 2) Is there a way to to convert to unit16 etc before it gets into MATLAB to avoid a out of memory message?
For example, maybe someone asks about 2010 prices, so you scan the file line by line, throwing away data if it belongs to any other year than 2010. Only if the year is 2010 do you use put it into your array. Other years just go into single variables because you used sscanf but you re-use (overwrite) those variables. So on a line by line basis you will have variables thisPrice, thisDay, thisVolume, thisYear, and only when this year = 2010 do you add thisPrice, thisDay, thisVolume to priceArray, dayArray, volumeArray.

Sign in to comment.

Categories

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!