This example shows how to execute a cell counting algorithm on a large number of images using Image Processing Toolbox™ with MATLAB® MapReduce and Datastore. MapReduce is a programming technique for analyzing data sets that do not fit in memory. The example also uses MATLAB Parallel Server™ to run parallel MapReduce programs on Hadoop® clusters. The example shows how to create a subset of the images so that you can test your algorithm on a local system before moving it to the Hadoop cluster.
This part of the example shows how to download the BBBC005v1 data set from the Broad Bioimage Benchmark Collection. This data set is an annotated biological image set designed for testing and validation. The image set provides examples of in- and out-of-focus synthetic images, which can be used for validation of focus metrics. The data set contains almost 20,000 files. For more information, see this introduction to the data set.
At the system prompt on a Linux system, use the wget
command to download the zip file containing the BBBC data set. Before
running this command, make sure that your target location has enough
space to hold the zip file (1.8 GB) and the extracted images (2.6
GB).
wget http://www.broadinstitute.org/bbbc/BBBC005/BBBC005_v1_images.zip
At the system prompt on a Linux system, extract the files from the zip file.
unzip BBBC005_v1_images.zip
Examine the image file names in this data set. They are constructed in a
specific format to contain useful information about each image. For example,
the file name
BBBC005_v1_images/SIMCEPImages_A05_C18_F1_s16_w1.TIF
indicates that the image contains 18 cells (C18
) and was
filtered with a Gaussian low-pass filter with diameter 1 and a sigma of
0.25x diameter to simulate focus blur (F1
). The
w1
identifies the stain used.
This example shows how to view the files in the BBBC data set and test an algorithm on a small subset of the files using the Image Batch Processor app. The example tests a simple algorithm that segments the cells in the images. (The example uses a modified version of this cell segmentation algorithm to create the cell counting algorithm used in the MapReduce implementation.)
Open the Image Batch Processor app. From the MATLAB Toolstrip, on the Apps tab, in the Image Processing and
Computer Vision group, click Image Batch Processor. You
can also open the app from the command license using the
imageBatchProcessor
command.
Load the cell image data set into the app. In the Image Batch Processor app, click Load Images and navigate to the folder in which you stored the downloaded data set
Specify the name of the function that implements your cell segmentation algorithm. To specify an existing function, type its name in the Function name field or click the folder icon to browse and select the function. To create a new batch processing function, click New and enter your code in the template file opened in the MATLAB editor. For this example, create a new function containing the following image segmentation code.
function imout = cellSegmenter(im) % A simple cell segmenter % Otsu thresholding t = graythresh(im); bw = im2bw(im,t); % Show thresholding result in app imout = imfuse(im,bw); % Find area of blobs stats = regionprops('table',bw,{'Area'}); % Average cell diameter is about 33 pixels (based on random inspection) cellArea = pi*(33/2)^2; % Estimate cell count based on area of blobs cellsPerBlob = stats.Area/cellArea; cellCount = sum(round(cellsPerBlob)); disp(cellCount);
Select a few images displayed in the app, using the mouse, and click Run to execute a test run of your algorithm.
For this example, choose only images with the
“w1
” stain. The segmentation
algorithm works best with these images.
Examine the results of running your algorithm to verify that your
segmentation algorithm found the correct number of cells in the image. The
names of the images contain the cell count in the C
number. For example, the image named
SIMCEPImages_A05_C18_F1_s05_w1.TIF
contains 18 cells.
Compare this number to the results returned at the command line for both
images.
18 18
This example shows how to set up a small test version on your local system of the large scale processing you want to perform. You should test your processing framework before running it on thousands of files. To do this, you must first create an image datastore to contain the images. MapReduce uses a datastore to process data in small chunks that individually fit into memory. For local testing, select a subset of the images in the datastore to facilitate a quicker, iterative development process. Once you have created the datastore, convert the sample subset of images into Hadoop sequence files, a format used by the Hadoop cluster.
Create an ImageDatastore
. Because the
cell segmentation algorithm implemented in
cellSegmenter.m
works best with the cell body stain,
select only the files with the indicator w1
in their file
names.
localDataFolder = '/your_data/broad_data/BBBC005_v1_images/'; w1FilesOnly = fullfile(localDataFolder,'*w1*'); localimds = imageDatastore(w1FilesOnly); disp(localimds);
ImageDatastore with properties: Files: { ' .../broad_data/BBBC005_v1_images/SIMCEPImages_A01_C1_F1_s01_w1.TIF'; ' .../broad_data/BBBC005_v1_images/SIMCEPImages_A01_C1_F1_s02_w1.TIF'; ' .../broad_data/BBBC005_v1_images/SIMCEPImages_A01_C1_F1_s03_w1.TIF' ... and 9597 more } ReadFcn: @readDatastoreImage
Note that there still are over 9000 files in the subset.
Subset this sample further, selecting every 100th file from the thousands of files in the data set. This brings the number of files down to a more manageable number.
localimds.Files = localimds.Files(1:100:end); disp(localimds);
ImageDatastore with properties: Files: { ' .../broad_data/BBBC005_v1_images/SIMCEPImages_A01_C1_F1_s01_w1.TIF'; ' .../broad_data/BBBC005_v1_images/SIMCEPImages_A05_C18_F1_s01_w1.TIF'; ' .../broad_data/BBBC005_v1_images/SIMCEPImages_A09_C35_F1_s01_w1.TIF' ... and 93 more } ReadFcn: @readDatastoreImage
Repackage the subset of images into Hadoop sequence files. Note that this step simply changes the data from one storage format to another without changing the data value. For more information about sequence files, see Getting Started with MapReduce (MATLAB).
You can use the MATLAB
mapreduce
function to do this
conversion. You must create a “map” function and a
“reduce” function which you pass to the
mapreduce
function. To convert the image files to
Hadoop sequence files, the map function should be a no-op function.
For this example, the map function simply saves the image data as-is, using
its file name as a key.
function identityMap(data, info, intermKVStore) add(intermKVStore, info.Filename, data); end
Create a reduce function that converts the image files into a key-value datastore backed by sequence files.
function identityReduce(key, intermValueIter, outKVStore) while hasnext(intermValueIter) add(outKVStore, key, getnext(intermValueIter)); end end
Call mapreduce
, passing your map and reduce functions.
The example first calls the mapreducer
function to
specify where the processing takes place. To test your set up and perform
the processing on your local system, specify 0
. (When run
locally, mapreduce
creates a key-value datastore back by
MAT-files.) In the following code, the mapreducer
function specifies that the operation should take place on your local
system.
mapreducer(0); matFolder = 'BBBC005_v1_subset_images_w1_mat'; localmatds = mapreduce(localimds, ... @identityMap, @identityReduce,... 'OutputFolder', matFolder); disp(localmatds);
******************************** * MAPREDUCE PROGRESS * ******************************** Map 0% Reduce 0% Map 10% Reduce 0% Map 20% Reduce 0% Map 30% Reduce 0% Map 40% Reduce 0% Map 50% Reduce 0% Map 60% Reduce 0% Map 70% Reduce 0% Map 80% Reduce 0% Map 90% Reduce 0% Map 100% Reduce 0% Map 100% Reduce 10% Map 100% Reduce 20% Map 100% Reduce 30% Map 100% Reduce 40% Map 100% Reduce 50% Map 100% Reduce 60% Map 100% Reduce 70% Map 100% Reduce 80% Map 100% Reduce 90% Map 100% Reduce 100% KeyValueDatastore with properties: Files: { ' .../results_1_12-Jun-2015_10-41-25_187.mat'; ' .../results_2_12-Jun-2015_10-41-25_187.mat' } ReadSize: 1 key-value pairs FileType: 'mat'
This example shows how to test your MapReduce framework on your local system. After creating the subset of image files for testing, and converting them to a key-value datastore, you are ready to test the algorithm. Modify your original cell segmentation algorithm to return the cell count. (The Image Batch Processor app, where this example first tested the algorithm, can only return processed images, not values such as the cell count.)
Modify the cell segmentation function to return a cell count.
function cellCount = cellCounter(im) % A simple cell counter % Otsu thresholding t = graythresh(im); bw = im2bw(im,t); % Show thresholding result in app % imout = imfuse(im,bw); stats = regionprops('table',bw,{'Area'}); % Average cell diameter is about 33 pixels (based on random inspection) cellArea = pi*(33/2)^2; % Estimate cell count based on area of blobs cellsPerBlob = stats.Area/cellArea; cellCount = sum(round(cellsPerBlob));
Create map and reduce functions that perform the desired processing. For
this example, create a map function that calculates the error count for a
specific image. This function gets the actual cell count for an image from
the file name coding (the C
number) and compares it to
the cell count returned by the segmentation algorithm.
function mapImageToMisCountError(data, ~, intermKVStore) % Extract the image im = data.Value{1}; % Call the cell counting algorithm actCount = cellCounter(im); % The original file name is available as the key fileName = data.Key{1}; [~, name] = fileparts(fileName); % Extract expected cell count and focus blur from the file name strs = strsplit(name, '_'); expCount = str2double(strs{3}(2:end)); focusBlur = str2double(strs{4}(2:end)); diffCount = abs(actCount-expCount); % Note: focus blur is the key add(intermKVStore, focusBlur, diffCount); end
Create a reduce function that computes the average error in cell count for each focus value.
function reduceErrorCount(key, intermValueIter, outKVStore) focusBlur = key; % Compute the sum of all differences in cell count for this value of focus % blur count = 0; totalDiff = 0; while hasnext(intermValueIter) diffCount = getnext(intermValueIter); count = count +1; totalDiff = totalDiff+diffCount; end % Average meanDiff = totalDiff/count; add(outKVStore, focusBlur, meanDiff); end
Run the mapreduce
job on your local system.
focusErrords = mapreduce(localmatds, @mapImageToMisCountError, @reduceErrorCount);
% Gather the result
focusErrorTbl = readall(focusErrords);
disp(focusErrorTbl);
averageErrors = cell2mat(focusErrorTbl.Value);
******************************** * MAPREDUCE PROGRESS * ******************************** Map 0% Reduce 0% Map 10% Reduce 0% Map 20% Reduce 0% Map 30% Reduce 0% Map 40% Reduce 0% Map 50% Reduce 0% Map 75% Reduce 0% Map 100% Reduce 0% Map 100% Reduce 13% Map 100% Reduce 25% Map 100% Reduce 38% Map 100% Reduce 50% Map 100% Reduce 63% Map 100% Reduce 75% Map 100% Reduce 88% Map 100% Reduce 100% Key Value ___ _________ 1 [ 0.8333] 4 [ 0.6667] 7 [ 0.5000] 10 [ 0.5000] 14 [ 0.5000] 17 [ 0.5000] 20 [ 0.5000] 23 [ 0.3333] 26 [ 0.8333] 29 [ 0.8333] 32 [ 3] 35 [ 7] 39 [12.3333] 42 [17.3333] 45 [18.3333] 48 [23.8333]
Inspect the results on the subset. The simple cell counting algorithm used here relies on the average area of a cell or a group of cells. Increasing focus blur diffuses cell boundaries, and thus the area. The expected result is for the error to go up with increasing focus blur. Plot the results.
bar(focusErrorTbl.Key, averageErrors); ha = gca; ha.XTick = sort(focusErrorTbl.Key); ha.XLim = [min(focusErrorTbl.Key)-2 max(focusErrorTbl.Key)+2]; title('Cell counting result on a test data set'); xlabel('Focus blur'); ylabel('Average error in cell count');
This example shows how to load all the image data into the Hadoop file system and run your MapReduce framework on a Hadoop cluster.
Load the image data into the Hadoop file system using the following shell commands. To run this
command, replace your_data
with the location on your
computer.
hadoop fs -mkdir /user/broad_data/ hadoop fs -copyFromLocal /your_data/broad_data/BBBC005_v1_images /user/broad_data/BBBC005_v1_images
Set up access to the MATLAB Parallel Server cluster. To run this command, replace
'your/hadoop/install'
with the location on your
computer.
setenv('HADOOP_HOME','/your/hadoop/install'); cluster = parallel.cluster.Hadoop; cluster.HadoopProperties('mapred.job.tracker') = 'hadoop01glnxa64:54311'; cluster.HadoopProperties('fs.default.name') = 'hdfs://hadoop01glnxa64:54310'; disp(cluster); % Change mapreduce execution environment to point to the remote cluster mapreducer(cluster);
Convert all the image data into Hadoop sequence files. This is similar to what you did on your local system when you converted a subset of the images for prototyping. You can reuse the map and reduce functions you used previously.
% Use the internal Hadoop cluster ingested with Broad Institute files broadFolder = 'hdfs://hadoop01glnxa64:54310/user/broad_data/BBBC005_v1_images'; % Pick only the 'cell body stain' (w1) files for processing w1Files = fullfile(broadFolder, '*w1*.TIF'); % Create an ImageDatastore representing all these files imageDS = imageDatastore(w1Files); % Specify the output folder. seqFolder = 'hdfs://hadoop01glnxa64:54310/user/datasets/images/broad_data/broad_sequence'; % Convert the images to a key-value datastore. seqds = mapreduce(imageDS, @identityMap, @identityReduce,'OutputFolder', seqFolder);
Run the cell counting algorithm on the entire data set stored in the Hadoop file system using the MapReduce framework. The only change from the local version now is that the input and output locations are on the Hadoop file system.
% Output location for error count. output = 'hdfs://hadoop01glnxa64:54310/user/broad_data/BBBC005_focus_vs_errorCount'; tic; focusErrords = mapreduce(seqds, @mapImageToMisCountError, @reduceErrorCount,... 'OutputFolder',output); toc % Gather result focusErrorTbl = readall(focusErrords); disp(focusErrorTbl); averageErrors = cell2mat(focusErrorTbl.Value); % Plot bar(focusErrorTbl.Key, averageErrors); ha = gca; ha.XTick = sort(focusErrorTbl.Key); ha.XLim = [min(focusErrorTbl.Key)-2 max(focusErrorTbl.Key)+2]; title('Cell counting result on the entire data set'); xlabel('Focus blur'); ylabel('Average error in cell count');
Parallel mapreduce execution on the Hadoop cluster: ******************************** * MAPREDUCE PROGRESS * ******************************** Map 0% Reduce 0% Map 7% Reduce 0% Map 10% Reduce 0% Map 12% Reduce 0% Map 20% Reduce 0% Map 23% Reduce 0% Map 25% Reduce 0% Map 28% Reduce 0% Map 30% Reduce 0% Map 32% Reduce 0% Map 33% Reduce 0% Map 38% Reduce 0% Map 41% Reduce 0% Map 43% Reduce 0% Map 48% Reduce 0% Map 50% Reduce 0% Map 51% Reduce 5% Map 53% Reduce 7% Map 55% Reduce 10% Map 56% Reduce 11% Map 58% Reduce 11% Map 62% Reduce 12% Map 64% Reduce 12% Map 65% Reduce 12% Map 67% Reduce 16% Map 69% Reduce 16% Map 71% Reduce 16% Map 74% Reduce 17% Map 75% Reduce 17% Map 76% Reduce 17% Map 79% Reduce 20% Map 83% Reduce 23% Map 85% Reduce 24% Map 88% Reduce 24% Map 92% Reduce 24% Map 94% Reduce 25% Map 96% Reduce 28% Map 97% Reduce 29% Map 100% Reduce 66% Map 100% Reduce 69% Map 100% Reduce 78% Map 100% Reduce 87% Map 100% Reduce 96% Map 100% Reduce 100% Elapsed time is 227.508109 seconds. Key Value ___ _________ 4 [ 1.1117] 7 [ 1.0983] 10 [ 1.0500] 14 [ 0.9317] 17 [ 0.8650] 20 [ 0.7583] 23 [ 0.6050] 26 [ 1.0600] 29 [ 1.5750] 32 [ 4.0633] 42 [18.9267] 48 [26.2417] 1 [ 1.0083] 35 [ 7.5650] 39 [13.2383] 45 [20.5500]