Automate Ground Truth Polygon Labeling Using Grounded SAM Model
This example shows how to combine Grounding DINO and the Segment Anything Model 2 (SAM 2) to automatically produce high‑quality polygon labels from raw video frames. This pairing is commonly referred to as Grounded SAM. In this workflow, you first use the Grounding DINO model to perform open‑vocabulary object detection and return axis‑aligned bounding boxes for classes of interest. You then pass those boxes to SAM and generate precise segmentation masks for each detection. Finally, you convert the segmentation masks into polygonal boundaries and use them to label objects in the video frames automatically using the Video Labeler app. You can then use the exported ground truth polygon labels to train and evaluate instance segmentation models.
This example demonstrates how to:
Design and implement the Grounded SAM workflow
Create a custom polygon labeling automation algorithm for the Grounded SAM workflow
Integrate and run the custom automation algorithm in the Video Labeler app
Review the generated polygon labels directly on video frames, and export the validated labels as ground truth
If you have large image or video data sets, you can first preprocess them using vision-language models (VLMs) such as Contrastive Language–Image Pre-training (CLIP) with natural‑language prompts to search and only retain frames that are likely to contain objects of interest. Once you identify relevant frames, you can apply the Grounded SAM automation process described in this example to create an end-to-end pipeline for automated labeling, review, and ground truth export to generate training data. For more information about how you can use VLMs to search and retain relevant frames from a large data set, see Automatically Search and Label Video Frames Using VLMs.
Load Video Frames
This example uses a video file of aerial footage of an airport, available from the U.S. Geological Survey (USGS).
Download the video. Read and visualize a frame from the video.
if ~exist("NAIPAerialImagery.zip","file") disp("Downloading Video (20 MB)...") unzip("https://ssd.mathworks.com/supportfiles/vision/data/NAIPAerialImagery.zip") end videoFile = "NAIPAerialImagery.avi"; V = VideoReader(videoFile); I = read(V,50); imshow(I)

Design and Implement Grounded SAM Polygon Labeling Workflow
The Grounded SAM polygon labeling algorithm combines the strengths of Grounding DINO and the Segment Anything Model 2 (SAM 2). The Grounding DINO model performs accurate, class‑based object detection, but returns only bounding boxes. SAM 2 generates precise segmentation masks, but is class-agnostic. Together, they enable automated generation of class‑specific polygon labels using these steps:
Detect objects using the pretrained Grounding DINO model.
Create segmentation masks using SAM 2.
Extract polygon boundaries from the segmentation masks.
The Grounded SAM algorithm requires a CUDA® enabled NVIDIA® GPU and Parallel Computing Toolbox™ for efficient execution because it relies on deep learning models. On GPUs with limited VRAM, running the Grounded SAM algorithm might exhaust GPU memory. If you encounter out‑of‑memory errors, reduce the image resolution or use a GPU with more available memory.
Detect Objects Using Pretrained Grounding DINO Model
Load a pretrained grounding DINO model using the groundingDinoObjectDetector object. This requires the Computer Vision Toolbox™ Model for Grounding DINO Object Detection add-on.
gdino = groundingDinoObjectDetector('swin-base');Specify the names of the object classes to detect and their brief descriptions. This example uses two classes: commercial jet and private jet.
classNames = ["commercialJet","privateJet"]; classDescriptions = ["commercial jet","private jet"];
Detect objects using the pretrained grounding DINO model, and visualize the bounding boxes on a video frame.
[bboxes,~,labels] = detect(gdino,I, ... ClassNames=classNames,ClassDescriptions=classDescriptions, ... Threshold=0.25,MaxSize=[150 150]); I_labeled = insertObjectAnnotation(I,"rectangle",bboxes,labels,LineWidth=4); imshow(I_labeled)

Create Segmentation Masks Using SAM 2
Load the pretrained SAM 2 model using the segmentAnythingModel object. This requires the Image Processing Toolbox™ Model for Segment Anything Model 2 add-on.
sam = segmentAnythingModel("sam2-tiny");Extract SAM encoder feature embeddings for the video frame used in object detection. If you have a CUDA-enabled NVIDIA GPU, input the image as a GPU array for faster execution.
if canUseGPU I_gpu = gpuArray(I); embeddings = extractEmbeddings(sam,I_gpu); else embeddings = extractEmbeddings(sam,I) end
Create segmentation masks for the detected objects using the bounding box coordinates from the Grounding DINO results. Visualize the image overlaid with the segmentation results.
maskstack = false([size(I,[1,2]),size(bboxes,1)]); for i = 1:size(bboxes,1) maskstack(:,:,i) = segmentObjectsFromEmbeddings(sam,embeddings,size(I),BoundingBox=bboxes(i,:)); end I_segmented = insertObjectMask(I,maskstack); imshow(I_segmented)

Extract Polygon Boundaries from Segmentation Masks
To extract polygon boundaries from each segmentation mask, follow these steps:
Extract polygon boundaries using
bwboundaries— Use thebwboundariesfunction with thenoholesoption to discard interior holes and return only the outer contour of each object.Select the polygon boundary of the largest connected region — When
bwboundariesreturns multiple disjoint regions, keep only the boundary corresponding to the largest connected component.Simplify polygon vertices using
reducepoly— Use thereducepolyfunction to reduce the number of vertices while preserving the overall shape. Tune the tolerance parameter to balance geometric accuracy and run-time performance.
polygonBoundaries = cell(size(bboxes,1),1); for i = 1:size(bboxes,1) % Compute boundaries of segmentation mask maskBoundaries = bwboundaries(maskstack(:,:,i),"noholes",CoordinateOrder="xy",TraceStyle="pixeledge"); % Extract boundary corresponding to the largest connected region [~,largestMaskIdx] = max(cellfun(@numel,maskBoundaries)); largestMask = maskBoundaries{largestMaskIdx}; % Interpolate to retain fewer polygon vertices for improved performance polygonBoundaries(i) = {reducepoly(largestMask,0.01)}; end
Visualize the polygon boundaries overlaid on the original video frame.
polygonOverlayedImage = insertShape(I,"polygon",polygonBoundaries,LineWidth=2);
imshow(polygonOverlayedImage)
Create Grounded SAM Polygon Labeling Automation Algorithm
This example provides the GroundedSAMPolygonLabeling class as a ready-to-use automation algorithm for polygon labeling using the Grounded SAM workflow. The GroundedSAMPolygonLabeling class inherits from the vision.labeler.AutomationAlgorithm abstract base class, which defines the class-based API that the Video Labeler app uses to configure and run custom automation algorithms.To help you get started with writing your own custom automation algorithm, the Video Labeler app offers a convenient initial automation class template where you can add custom logic and integrate it into the app. For more details on accessing the template from the Video Labeler app, see Create Custom Automation Algorithm for Labeling.
The first set of properties in the GroundedSAMPolygonLabeling.m class specify the name of the algorithm, provide a brief description of it, and give directions for using it.
properties(Constant)
% Name: Give a name for your algorithm.
Name = 'GroundedSAMPolygonLabeling';
% Description: Provide a one-line description for your algorithm.
Description = 'This algorithm uses grounding DINO and SAM (Grounded-SAM) to annotate airplanes.';
% UserDirections: Provide a set of directions that are displayed
% when this algorithm is invoked. The directions
% are to be provided as a cell array of character
% vectors, with each element of the cell array
% representing a step in the list of directions.
UserDirections = {...
['Automation algorithms are a way to automate manual labeling ' ...
'tasks. This AutomationAlgorithm is a template for creating ' ...
'user-defined automation algorithms. Below are typical steps' ...
'involved in running an automation algorithm.'], ...
['Run: Press RUN to run the automation algorithm. '], ...
['Review and Modify: Review automated labels over the interval ', ...
'using playback controls. Modify/delete/add ROIs that were not ' ...
'satisfactorily automated at this stage. If the results are ' ...
'satisfactory, click Accept to accept the automated labels.'], ...
['Change Settings and Rerun: If automated results are not ' ...
'satisfactory, you can try to re-run the algorithm with ' ...
'different settings. In order to do so, click Undo Run to undo ' ...
'current automation run, click Settings and make changes to ' ...
'Settings, and press Run again.'], ...
['Accept/Cancel: If results of automation are satisfactory, ' ...
'click Accept to accept all automated labels and return to ' ...
'manual labeling. If results of automation are not ' ...
'satisfactory, click Cancel to return to manual labeling ' ...
'without saving automated labels.']};
end
The next section of the GroundedSAMPolygonLabeling class specifies the custom properties required by the core algorithm. The GroundingDINONetwork and the SAMNetwork properties hold the pretrained Grounding DINO and SAM 2 models, respectively. The Labels and LabelDescriptions properties define the object classes used by Grounding DINO for detection. Labels specifies the class names, while LabelDescriptions provides the corresponding natural‑language descriptions that Grounding DINO uses as text queries for each class.
properties
GroundingDINONetwork
SAMNetwork
Labels
LabelDescriptions
end
checkLabelDefinition, the first method defined in GroundedSAMPolygonLabeling, checks that only labels of type PolygonLabel are enabled for automation. PolygonLabel is the only label type needed because the GroundedSAMPolygonLabeling algorithm is automating polygon labeling using the Grounded SAM workflow.
function isValid = checkLabelDefinition(~,labelDef) isValid = (labelDef.Type == labelType.Polygon); end
The next set of methods control the execution of the algorithm. The vision.labeler.AutomationAlgorithm class includes an interface that contains methods like initialize, run, and terminate for setting up and running the automation with ease. The initialize method runs once per automation session and populates the initial algorithm state based on the existing labels in the app. In the GroundedSAMPolygonLabeling class, the initialize method has been customized to load the pretrained Grounding DINO and SAM 2 models, as well as create object class labels and label descriptions for Grounding DINO.
function initialize(algObj, ~) algObj.GroundingDINONetwork = groundingDinoObjectDetector('swin-base'); algObj.SAMNetwork = segmentAnythingModel("sam2-tiny"); algObj.Labels = ["commercialJet","privateJet"]; algObj.LabelDescriptions = ["commercial jet","private jet"]; end
Next, the run method defines the core Grounded SAM algorithm of this automation class. run is called for each video frame, and expects the automation class to return a set of labels. The run method in GroundedSAMPolygonLabeling.m contains the logic previously introduced to create a cell array of polygon labels corresponding to the commercialJet and privateJet object classes. The run method detects objects using the pretrained Grounding DINO model, uses the bounding boxes to create segmentation masks using SAM 2, and then extracts polygon boundaries from the segmentation masks. For optimal performance, execute the algorithm on a CUDA‑enabled NVIDIA GPU.
function autoLabels = run(algObj,I) % Create output structure array for generated labels autoLabels = struct("Type",{},"Name",{},"Position",{}); numLabels = numel(algObj.Labels); for i = 1:numLabels autoLabels(i).Type = labelType("Polygon"); autoLabels(i).Name = algObj.Labels(i); end % Detect objects using Grounding DINO model in the current frame [bboxes,~,labels] = detect(algObj.GroundingDINONetwork,I, ... ClassNames=algObj.Labels,ClassDescriptions=algObj.LabelDescriptions, ... Threshold=0.25,MaxSize=[150 150],MinSize=[10 10], ... ExecutionEnvironment="auto"); % Return if there are no objects detected if isempty(bboxes) return end % Extract SAM encoder feature embeddings for the current frame if canUseGPU embeddings = extractEmbeddings(algObj.SAMNetwork,gpuArray(I)); else embeddings = extractEmbeddings(algObj.SAMNetwork,I); end % Get polygon boundaries for each segmentation mask for i = 1:size(bboxes,1) % Get segmentation masks for each detected bounding box mask = segmentObjectsFromEmbeddings(algObj.SAMNetwork,embeddings, ... size(I),BoundingBox=bboxes(i,:)); % Compute boundaries of segmentation mask maskBoundaries = bwboundaries(mask,"noholes",CoordinateOrder="xy",TraceStyle="pixeledge"); % Extract boundary corresponding to the largest connected region [~,largestMaskIdx] = max(cellfun(@numel,maskBoundaries)); largestMask = maskBoundaries{largestMaskIdx}; % Interpolate to retain fewer polygon vertices for improved performance largestMaskBoundary = {reducepoly(largestMask,0.01)}; % Determine which label definition this detection belongs to labelIdx = find(strcmp(algObj.Labels,string(labels(i))),1); % Store polygon boundaries under the correct label autoLabels(labelIdx).Position(end+1) = largestMaskBoundary; end end
Integrate and Use Grounded SAM Polygon Labeling Automation in Video Labeler App
The GroundedSAMPolygonLabeling.m automation algorithm class file implements the properties and methods described in the previous section. To integrate and use this class in the Video Labeler app, follow these steps:
1. Create the folder structure +vision/+labeler in the current folder, and copy the automation class into it.
mkdir('+vision/+labeler'); copyfile('GroundedSAMPolygonLabeling.m','+vision/+labeler');
2. Open the videoLabeler app with the loaded video data to label.
videoLabeler("NAIPAerialImagery.avi")
3. On the Video Labeler tab of the app toolstrip, in the Label Definition section, click Add Label and select Polygon from the drop down. Define two polygon labels with names commercialJet and privateJet. You can choose a color for each label. Click OK.

4. On the Video Labeler tab, in the Automate Labeling section, click Select Algorithm > Refresh list.

5. From the Select Algorithm options, select GroundedSAMPolygonLabeling. If you do not see this option, ensure that the current working folder has a folder called +vision/+labeler that contains a file named GroundedSAMPolygonLabeling.m.
6. In the Automate Labeling section of the app toolstrip, click Automate. A new panel opens, displaying directions for using the algorithm.
7. On the Automate tab of the app toolstrip, click Run. The created algorithm executes on each frame of the image sequence, creating po commercialJet and privateJet categories. After the run is complete, use the slider or arrow keys to scroll through all the frames and verify the results of the automation algorithm.

Review Polygon Labels Generated by Grounded SAM Automation Algorithm
The Grounded SAM automation algorithm accurately labels most commercial and private jets, but it can also produce some misclassifications or miss a few airplanes. The Video Labeler app enables you to review the generated labels and make manual corrections where necessary.
To fix misclassified labels, right-click a polygon label in the center pane, then select Reassign ROI and reassign the label to the correct label definition. To label missed airplanes, select the appropriate label from the ROI Label Definition pane and draw the polygon manually. To refine a label, click a polygon and adjust the autogenerated vertices.

Once you are satisfied with the polygon labels across the entire image sequence, click Accept on the app toolstrip to finalize the annotations.
You can also export the generated labels as a groundTruth object for downstream training or evaluation using the Video Labeler app. On the Video Labeler tab of the app toolstrip, select Export and choose whether to export the labeled ground truth to the workspace or to a MAT file.
Next Steps
With the automation for polygon labeling complete and the ground truth data exported, you can now create training data for training and evaluating instance segmentation models. For more information, see Create Instance Segmentation Training Data From Ground Truth. Alternatively, for a more light-weight polygon labeling automation example using a pretrained SOLOv2 network, see Automate Ground Truth Labeling for Instance Segmentation.
This automation workflow is useful in generating first-pass labels which can significantly accelerate labeling. To learn more about automation for other computer vision tasks such as semantic segmentation, optical character recognition (OCR), and Re-Identification (ReID) labeling see Automate Ground Truth Labeling for Semantic Segmentation, Automate Ground Truth Labeling for OCR and Automate Ground Truth Labeling for Object Tracking and Re-Identification, respectively.
See Also
groundingDinoObjectDetector | segmentAnythingModel | extractEmbeddings | segmentObjectsFromEmbeddings | bwboundaries
Topics
- Create Instance Segmentation Training Data From Ground Truth
- Get Started with AI-Assisted and Automated Labeling
- Create Custom Automation Algorithm for Labeling
- Automatically Search and Label Video Frames Using VLMs
- Automate Ground Truth Labeling for Semantic Segmentation
- Automate Ground Truth Labeling for OCR
- Automate Ground Truth Labeling for Object Tracking and Re-Identification