Finding partial overlaps between sets

10 views (last 30 days)
Let us say I have some equal-sized sets of indices. I want to determine how much these sets overlap in the sense that I want to know how many of the sets contain a given index. I don't know what this is called, but I think of it as a kind of "soft intersection". I imagine doing it in one of the following ways, but I would like suggestions as to whether this is reasonable and possible better ways.
  1. The indices in the sets can be in (1:N). I have K sets. I could create K length-N indicator arrays and add these K arrays together. The resulting length-N array would contain occurrence counts of the indices of their positions in the array.
  2. Alternatively, I could create the union of the K sets. Then I could use find() to find occurrences of each of the members of the union set in a concatenation of the K index sets. The sizes of the results of find() would be occurrence counts of the indices in the sets.
  1 Comment
Thomas Arildsen
Thomas Arildsen on 10 Jan 2013
I should add that the indices in each of the K sets are unique in that set, so the sets do not contain any duplicates.

Sign in to comment.

Accepted Answer

Thomas Arildsen
Thomas Arildsen on 10 Jan 2013
OK, that didn't take very long to try... I implemented no. 1 as follows:
function [ idxUnion, idxCounts ] = overlap1( idxSets )
%OVERLAP1 quantifies overlap between sets.
% The function counts in how many of the input sets each index occurs.
% The sets are specified as columns of the input matrix.
K = size(idxSets,2);
N = max(idxSets(:));
indicator = zeros(K,N);
for k = 1:K
indicator(k,idxSets(:,k)) = 1;
end
counts = sum(indicator,1);
idxUnion = find(counts ~= 0);
idxCounts = counts(idxUnion);
end
I implemented no. 2 as follows:
function [ idxUnion, idxCounts ] = overlap2( idxSets )
%OVERLAP2 quantifies overlap between sets.
% The function counts in how many of the input sets each index occurs.
% The sets are specified as columns of the input matrix.
K = size(idxSets,2);
N = max(idxSets(:));
idxUnion = unique(idxSets)';
idxCounts = zeros(size(idxUnion));
for ii = 1:length(idxCounts)
idxCounts(ii) = length(find(idxUnion(ii) == idxSets));
end
end
I test them as follows:
function [ result ] = overlap_test()
%OVERLAY_TEST Tests the overlay1 and overlay2 functions for errors.
% This function runs overlay1 and overlay2 with a random test input and
% performs a couple of sanity checks on the output to validate these
% functions.
K = 100;
M = 100;
N = 1e4;
testSets = zeros(M,K);
for k = 1:K
testSets(:,k) = randperm(N,M)';
end
% Test overlap1 union
tic
[idxUnion1,idxCnts1] = overlap1(testSets);
t1 = toc;
disp(sprintf('overlap1 took %f s\n',t1))
result(1,1) = isempty(setdiff(idxUnion1,unique(testSets)));
% Test overlap2 union
tic
[idxUnion2,idxCnts2] = overlap2(testSets);
t2 = toc;
disp(sprintf('overlap2 took %f s\n',t2))
result(2,1) = isempty(setdiff(idxUnion2,unique(testSets)));
% Test agreement on counts
try
result(3,1) = all(idxCnts1 == idxCnts2);
catch anError
warning(anError.identifier,anError.message);
result(3,1) = 0;
end
end
No. 1 seems clearly faster.

More Answers (0)

Tags

Products

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!