Euclidean distance (ED)calculation in matlab
Show older comments
In my program, I have a matrix obtained after lexicographic sorting. Its dimensions are 347275x64 double. My question is "how we can find euclidean distances (ED) between pairs of rows in this matrix?" Also, in my research paper, they specify that we don't need to find the ED between all the pairs of rows, but only for those pairs of rows which are close to offset threshold. So how to decide this value of offset threshold? Can anyone share his / her knowledge about this?
Please help me in this euclidean distance problem.
Thanks in advance.
3 Comments
Image Analyst
on 12 Sep 2017
What is the thresholding done ON???
If it's done on the ED, then how can you know which pairs are "close" until you've done them all? So you can't do it just on the close ones because you don't know which ones are close yet so you must do it on them all.
If the thresholding is done on "how many rows apart" the pairs are, like only do it for rows within 5 rows of the current row, then you only have to do it on some of the rows, not all, but you still have to decide on that number (5 or however many rows apart). If you're following a published paper, don't they say what the threshold is on, and what the value is or how to determine it?
Walter Roberson
on 12 Sep 2017
"If it's done on the ED, then how can you know which pairs are "close" until you've done them all?"
KDTree type searches and quadtree and octree type searches can be more efficient than exhaustive search for this purpose; https://en.wikipedia.org/wiki/K-d_tree#Complexity . For KDTree the building cost is O(n * log(n)) or sometimes less; each query is O(log n)
TUSHAR MURATKAR
on 14 Sep 2017
Accepted Answer
More Answers (2)
Walter Roberson
on 12 Sep 2017
3 votes
4 Comments
TUSHAR MURATKAR
on 12 Sep 2017
Walter Roberson
on 12 Sep 2017
You said that distances should be calculated only for rows that are within an offset threshold. If that threshold is by distance then rangesearch does this calculation for you.
Did you instead mean that the number of rows apart in the matrix is what should be limited?
TUSHAR MURATKAR
on 12 Sep 2017
Walter Roberson
on 12 Sep 2017
The rangesearch documentation has an example of searching by radius with a KDTree; https://www.mathworks.com/help/stats/searcher.rangesearch.html#bui0h43-1
Data = rand(347275, 64) * 1000;
nData = size(Data, 1);
L = nchoosek(1:nData, 2); % Not "{1:1:nData}", but a vector
DE = zeros(2, size(L, 1)); % Pre-allocate
iDE = 0;
thresh = 10^2; % Squared distance
for k = 1:size(L, 2) % SIZE(L, 2) is safer than LENGTH(L)
% Using the function directly is much faster than an anonymous function.
% Calculate the expensive SQRT on demand only:
D2 = sum((Data(L(k,1), :) - Data(L(k,2), :)) .^ 2);
if D2 < thresh
iDE = iDE + 1;
DE(1, iDE) = sqrt(D2); % Store value and index
DE(2, iDE) = k;
end
end
DE = DE(:, 1:iDE); % Crop unused pre-allocated data
Now DE(1, :) contains the distances and L(DE(2, :)) the corresponding indices.
Note that nchoosek(1:347273, 2) has 6.03e10*2 elements. Matlab's nchoosek stores this in a double matrix, such that you need 964.8 GB RAM for this matrix. Using the faster FEX: VChooseK can reduce this by 50% using UINT32 indices:
L = VChooseK(uint32(1):uint32(nData), 2);
But a half TB is still a lot of free RAM. It will be cheaper to create the indices dynamically, but the pre-allocation of the output remains a problem.
% [EDITED, UTC 15:42 15-Sep-2017, Typos fixed, indices fixed, transposed Data]
Data = transpose(Data); % Faster to access 1st dimension
nData = size(Data, 2);
nOut = 1e6; % A guess
DE = zeros(3, nOut); % Pre-allocate the output
iDE = 0;
thresh = 10^2; % Squared distance
L = nchoosek(nData, 2); % Number of outputs, not an array
k1 = 1;
k2 = 2;
for k = 1:L
D2 = sum((Data(k1, :) - Data(k2, :)) .^ 2);
if D2 < thresh
iDE = iDE + 1;
if iDE > nOut % Expand the array, a re-allocation
nOut = nOut + 1e6;
DE(:, nOut) = 0;
end
DE(:, iDE) = [sqrt(D2), k1, k2];
end
% Get new indices:
k2 = k2 + 1;
if k2 > nData
k1 = k1 + 1;
k2 = k1 + 1;
end
end
DE = DE(:, 1:iDE); % Crop unused pre-allocated data
This is still a huge problem and needs about 2.5 days to run. It can be accelerated by a vectorization, but I do not have more time currently to implement this. Walter's suggestion rangesearch() seems to be more promising.
[EDITED] A vectorized and leaner version:
function Near = GetNearPoints(X, Thresh)
% Input: X: [nPoint x m] matrix, nPoints with m dimensions
% Thresh: Positive value
% Output: Near: [nOut x 3], The indices of the rows and the distance
% Author: Jan Simon, License: CC BY-SA 3.0
D = transpose(X);
n = size(D, 2);
C = cell(1, n); % Collect output
T2 = Thresh ^ 2; % Squared threshold
for k = 1:n
D2 = sum(bsxfun(@minus, D(:, k), D(:, k+1:n)) .^ 2, 1);
m = (D2 < T2); % Compare squared distances
if any(m)
C{k} = [repmat(k, sum(m), 1), find(m).' + k, sqrt(D2(m)).'];
end
end
Near = cat(1, C{:});
This is a little bit faster. A parfor will be useful, if you have the parallel processing toolbox. With 8 cores, this might take some hours only. This brute force method cannot be much faster.
3 Comments
TUSHAR MURATKAR
on 15 Sep 2017
Stephen23
on 15 Sep 2017
@TUSHAR MURATKAR: you can thank Jan Simon by accepting the answer that actually solves your question.
Jan
on 16 Sep 2017
I do not think that this brute force method solves the problem efficiently, but Walter's suggestion is much better.
A divide and conquer approach will be smarter also: Searching the complete data set requires nchoosek(347275, 2) = 60.3e9 comparisons. If the volume is split into 2 halves (and considering the an extra interval with the width of the threshold), reduces the problem to 2*nchoosek(347275, 2) + X = 30.1e9 comparisons (plus the small overhead for the margin). This will reduce the processing speed by nearly 50% - and this can be applied repeatedly in X and Y direction.
If this strategy is applied strictly, you come to a k-D tree approach, see e.g. https://www.mathworks.com/matlabcentral/fileexchange/4586-k-d-tree. Unfortunately I cannot get this code to work...
Categories
Find more on Particle & Nuclear Physics in Help Center and File Exchange
Products
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!