Consider this command:
As the function suggest the input image is a huge tiff image (44GB+). So, no way, I can load it onto memory on a machine that has 16GB memory in total.
The output of the function_handle has the same number of rows and columns of the input except it has only two layers of single precision value, so the output is a single precision floating point number which defines the computed value based on the input image. as I mentioned the input image would be 44GB+ 4 band Tiff image, where each band is uint8. Therefore, the output of the entire image would be twice that, i.e. 88GB+, (2 band x 32bits per pixel per band). So, it is also not possible to get the output as one matrix on a machine with total of 16GB memory.
Since, these are geolocated, I need to store it as tiff, and as the size of the image is too big I definitely need to write it as BigTiff; hence, I am using a customized BigTiff_Adapter. "blockSize" is set to tile size, (my input tiff image is tiled, 256x256 pixels per tile). So, pretty much one tile is loaded, processed via function_handle and then written to the output BigTiff tile by tile.
So, here is the question that I don't get it. Once I set "UseParallel" to false; everything works just fine. except that it takes a long time. You would think that using parallel proc should improve that. However, once I set the " UseParallel " to true, all my memory is used and it seems that it takes even slower to compute compared to the time that " UseParallel " is set to false. Literally, the serial version seems to be much faster.
If you are thinking of the communication cost, don't bother. The computation in the function_handle is so easy and it only needs the data within one pixel. So no communication at all. Let's say the the output pixel o(i,j)=K*reshape(I(i,j,:),,1), where o(i,j) is the pixel on i-th row and j-th column of the output, K is a 1x3 matrix, and reshape(I(i,j,:),,1) is a column vector of size 3x1 . As you can see, I don't need any information from the neighboring block or even neighboring pixels. So, absolutely no inter communication (seems a heaven for parallel proc). All this said, It is not the communication problem that slows down the parallel version, or better to say that the communication is not within function_handle (I don't know what MATLAB does under the hood).
Any idea why turning UseParallel increases the memory usage and causing slower calculation? On some machine I even get java.lang.OutOfMemoryError.
I have to add that once I use an image of let's say 8 or 9GB everything works just fine, both parallel set to true or false, doesn't produce any problem. But when I track the code, it appears that the memory usage goes up, it appears that the entire image seems to be loaded. I am controlling the number of workers, it changes between 4 and 12. If each block is assigned to one worker, then 12 block of 256x256x4 must be loaded at any time, i.e. 3MB, my output should be twice that so the output should be 6MB, and There are not that many intermediate variable during computation, but let's say they also take another 24MB. So, practically I shouldn't see more than 40MB memory usage, but I see at least 5 6 GB of memory usage.
So, what's going on?