How to read file sequentially in Parfor loop?

5 views (last 30 days)
I am running parfor loop to analyze experiment data files on a local cluster with Intel Xeon Quad-core CPUs (8 Dual-CPU computers, 64 physics cores in total).
Each cluster node has Windows Server 2008 R2 Datacenter OS and MATLAB R2013a with Parallel Computing Toolbox and MATLAB Distributed Computing Server.
The files were written to the hard drive sequentially as wfm1.bin, wfm2.bin, wfm3.bin, ... until wfm42000.bin.
The file size is 7.00 MB each, and it took 11.7 seconds to analyze one file on a single core.
I built a parfor loop, to let each core in the cluster to:
(1) Read 1 file from the shared directory in the storage server #1 (node00);
(2) Analyze the data extracted from this file, and save the result (a 13 kB file) to another shared directory in storage server #2 (node16)
But when I open a matlabpool with size bigger than 32, the network data traffic from storage server gets jammed easily (Maximum 13 MB/s output rate for 1Gbps network interface across the entire cluster, all nodes are equipped with SATA3 6.0Gb/s HDD, and the point-to-point file transfer rate can reach 100 MB/s using Windows Explorer). I believe this conflict is caused by reading multiple non-consecutive files stored on different physical location on the same hard drive.
Is there any methods to control the parfor session to read files one after another one, in order to avoid the network traffic jam?
Other parallel solutions are also appreciated.
Thank you!
Here is the skeleton of my code:
cluster_size=32;
Bin_Folder_Name = '\\node00\New_RawData\';
Dat_Folder_Name = '\\node16\Fitted_Data_Storage\';
matlabpool('Cluster', cluster_size)
parfor j=1:42000
File_name=sprintf('wfm%d.bin', j);
Bin_File_name=strcat(Bin_Folder_Name, File_name);
File_name=sprintf('result%d.dat', j);
Dat_File_name=strcat(Dat_Folder_Name, File_name);
API_Mul_5_1_Sub(Bin_File_name, Dat_File_name)
end
matlabpool close
  2 Comments
Kirby Fears
Kirby Fears on 16 Sep 2015
Reading files sequentially goes against the entire idea of simultaneous parallel computing with parfor. Have you benchmarked the speed of this with a regular for loop?
I'm not sure if it would help, but you could break your parfor into fewer iterations with a non-parallel for loop inside to give you sequential file reading.
Below is an example of only 4 parallel threads that are each reading a sequential subset of your files.
parfor j=1:4,
for k=(1:10500 + (j-1)*10500),
File_name=sprintf('wfm%d.bin', k);
Bin_File_name=strcat(Bin_Folder_Name, File_name);
File_name=sprintf('result%d.dat', k);
Dat_File_name=strcat(Dat_Folder_Name, File_name); API_Mul_5_1_Sub(Bin_File_name, Dat_File_name)
end
end
You could play around with j (try j=1:2, 1:4, etc) to see if a smaller number of parallel jobs helps.
Haoyu Wang
Haoyu Wang on 16 Sep 2015
Thank you Kirby, but I thought the method you proposed will make the workers in matlabpool read non-sequential files when they are running.
Inspired from your comment, I build a for-[parfor-end]-end double layer stucture, and this works slightly better: I can have 40 cores running in parallel without having any traffic jam in the network.
Again, Thank you for your help!

Sign in to comment.

Accepted Answer

Walter Roberson
Walter Roberson on 16 Sep 2015
Unless the hard drive has been optimized to place sequentially named files near each other, it does not matter for disk utilization whether a group of similarly named files is read per parfor iteration or if only one file is read per parfor iteration.
For some operating systems, there are disk optimizers that will group files by name.
But with it taking 11 seconds to process each file, if you read-process-read-process in a single parfor iteration then that is a lot of "dead time" on disk I/O between processing sequentially named files; you might as well not bother. It could make a difference, though, if you used read-read-read-read-process-process-process-process.
To improve disk I/O, consider writing the files grouped together, so that reading one metafile grabs several original files (with each metafile being a single consecutive block on the disk.) If nothing else comes to mind, ZIP them together and use one of the methods to unzip to memory; https://www.mathworks.com/matlabcentral/newsreader/view_thread/290857 or https://www.mathworks.com/matlabcentral/newsreader/view_thread/240060 or http://www.mathworks.com/matlabcentral/newsreader/view_thread/290817 . And if your compute nodes have temporary disk space then you could resort to copying a large .zip file from the storage server, unzipping it to the local temp space and reading the individual files from there. Combine this with a randomized initial delay so that the workers are not all trying to hit the storage server at the same time.

More Answers (1)

Edric Ellis
Edric Ellis on 17 Sep 2015
To get finer-grained control over the ordering of parallel operations, you can use spmd blocks instead of parfor loops. The basic pattern would be to do something like this:
spmd
for idx = 1:numlabs:(numFiles + numlabs)
myFileIdx = idx + labindex - 1;
if myFileIdx <= numFiles
% process file with index myFileIdx
else
% skip - we've passed the end
end
% The "labBarrier" call here forces all workers to
% wait until they all reach this call. This stops
% workers from racing ahead.
labBarrier();
end
end

Categories

Find more on Parallel Computing Fundamentals in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!