Find similar values in a matrix

I am currently working with large data sets, on the range of 500k-1m rows of data in any given matrix. (nx3)
I want to know how to sift through the rows of the matrix to see if any of the rows have the same values in them.
ex. [1 2 3; 2 3 4; 3 4 5; 4 5 6; 1 2 3; 5 6 7]
I want to remove the second [1 2 3] row, such that [1 2 3; 2 3 4; 3 4 5; 4 5 6; 5 6 7]
Can anyone help me with this?

 Accepted Answer

The unique function can help here:
A = [1 2 3; 2 3 4; 3 4 5; 4 5 6; 1 2 3; 5 6 7];
[Au,ia,ic] = unique(A, 'rows', 'stable');
RowIdxFreq = accumarray(ic, 1);
RowIdxFreq =
2
1
1
1
1
The ‘RowIdxFreq’ variable has the frequencies of the occurrences of the rows. Here, row #1 is repeated.

7 Comments

Star Strider,
I did see that command in the documentation, and it is helpful. What I don't get from that is 1) where the second instance occurs and 2) a way to delete that row from the matrix without leaving 0's in it's place.
Would just setting a value,
X = unique(A, 'rows', 'stable')
return the matrix without those rows?
Yes. Try it and see.
To find and delete what rows are repeated, this works:
A = [1 2 3; 2 3 4; 3 4 5; 4 5 6; 1 2 3; 5 6 7; 3 4 5];
[Au,ia,ic] = unique(A, 'rows', 'stable');
RowIdxFreq = accumarray(ic, 1)
Repeats = find(RowIdxFreq > 1);
RowsToDelete = [];
for k1 = 1:length(Repeats)
RepeatedRows{k1} = find(ic == Repeats(k1));
RowsToDelete = [RowsToDelete; RepeatedRows{k1}(2:end)];
end
A(RowsToDelete,:) = []; % ‘A’ With Repeated Rows Deleted
The ‘Repeats’ assignment finds the first row that has repeats elsewhere in the matrix, and the ‘Repeated Rows’ is a cell array that contains the rows that are duplicated. The ‘RowsToDelete’ keeps track of all of them, then the ‘A’ assignment after the loop uses it to delete all of them at once.
It is not necessary to keep the ‘RepeatedRows’ data in an array. I did here because I wanted to be certain it was doing what I wanted it to.
Using the unique function is now not working. I have a cell array, which is composed of a string of letters and then coordinates, ex [C 1 1 1], which I created by
X = [atom_names num2cell(atomPosition_flat)];
This gives me a cell array (nx4).
I try to use unique to find where the repeated rows are,
atomPositions = unique(X,'rows','stable');
But get this error: Input A must be a cell array of strings.
Using num2str on the atomPosition_flat matrix (nx3) turns it into an nx33 char.
Without having your matrix to experiment with, I can only guess.
See if adding a cell reference (the ‘{}’ brackets) works:
atomPositions = unique(X{:},'rows','stable');
If you have a relatively ‘uncomplicated’ cell array, that should work. If unique still has problems, you might have to use sprintf to convert the numbers to strings before you do the operations in my code. (I assume ‘atom_names’ are already strings.)
lsutiger1
lsutiger1 on 2 Feb 2016
Edited: lsutiger1 on 2 Feb 2016
The matrix is a 2520x3 matrix, and yes, atom_names is a 2520x1 vector of strings. I tried converting it to strings using num2str, which did not work, because then I got a "dimension mismatch" error when num2str converted my matrix into a 2520x33 char. Will try to use sprintf.

Sign in to comment.

More Answers (0)

Categories

Asked:

on 2 Feb 2016

Edited:

on 2 Feb 2016

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!