I am working with a single csv file of over 10G size. I'm trying to use mapreduce to perform some analyses. The program works as expected but I'd like to speed a bit by increasing readsize.
Currently, it only passes about 1500 rows each time, although by default it is 20000 rows. I have in total over 500k rows so this creates a lot of passes that significantly slow the process.
After looking into the tabulartextdatastore, it seems that the buffer chunk that has been hardcoded to be 32MB that limits this.
So my questions are:
1) What the consideration is that MATLAB applies this 32MB buffer size
2) Whether there is a good way to manipulate this as I'd like to decrease the number of passes while increase the size within each pass (I have about 400G RAM).
2 Comments
Direct link to this comment
https://uk.mathworks.com/matlabcentral/answers/524896-datastore-readsize-and-buffer-chunk#comment_846265
Direct link to this comment
https://uk.mathworks.com/matlabcentral/answers/524896-datastore-readsize-and-buffer-chunk#comment_846265
Direct link to this comment
https://uk.mathworks.com/matlabcentral/answers/524896-datastore-readsize-and-buffer-chunk#comment_846766
Direct link to this comment
https://uk.mathworks.com/matlabcentral/answers/524896-datastore-readsize-and-buffer-chunk#comment_846766
Sign in to comment.