How to remove duplicate rows from .mat file or text file with out sorting?

Hello all, I have 64954930*3 matrix. There are duplicate rows are exist. How can i delete with out sorting of original matrix.
This is my code
load('myfile.mat');
%A = unique(myfile(:,:),'rows');
A = myfile(:,:);
% B = unique(A,'stable');
% save('B.mat','B');
%C = setxor(A,'rows','stable');
should i write forloop? i used different formats. but getting various errors. such as sort input arguments of struct; Undefined myfile. Kindly help me. Thanks in advance.

7 Comments

Why is your code commented out? And why are you overwriting variable A in the third line? Can you attach a sample of your .mat file?
Hi, thanks for your reply. I tried in different ways to remove duplicate rows. no method is success in my case. In third, i want to assign to 'A' variable so that i can apply ''unique'' function in 4th line.
i am unable to attach my file. it has maore than specified size(by Matlab site). i attached a bit screen short of my file. Kindly check it out.
Right, you are working with floating point numbers. Try using uniquetol.
Hi i tried this one also. i dont know exactly, what's wrong in my code. Can you give an axample to related my data code. i attached the screenshot of error.
i tried it too. may be can you suggest with for loop with small example. cause it doesn't works to me. thanks in advance!
Nothing is going to work for you if you get the message:
Undefined function or variable 'A'.
becuause it means there is no variable called A...

Sign in to comment.

 Accepted Answer

Seems you are really close to finding the solution.
out=unique(A,'rows','stable')
should give you the unsorted unique rows :)

14 Comments

@jonas, thanks for your reply. I tried this unique function also. here i got error -- Undefined function or variable 'A'. and also i tried this way
A =Load('myfile.mat')
B = unique(A,'Stable')
save('B.mat',B)
but not successful!!!
Seems A is not loaded correctly, as it is undefined.
From the comments I read that you are working with floats, so you should use uniquetol instead as suggested by Paolo
%Sample data set, all rows are unique
A=[1,2;1,2;3,3;3,4;1,2;3,4;5,5;6,4;1,1]+rand(9,2).*0.01;
%Make unique sorted matrix, unique defined by tolerance of 0.01
[v,ind]=uniquetol(A,0.01,'ByRows',1)
%Make unsorted matrix
out=nan(9,2)
out(ind,:)=A(ind,:)
out(isnan(out(:,1)),:)=[]
You need to define the tolerance on the basis of your own data, i.e. what do you define as a duplicate?
@jonas here you are defined your matrix with random. But here i have txt/mat file. i tried to use
B = uniquetol(A); save('B.mat','B');
is there anyway to read my mat file. to save result!
I don't understand?
When you type
load('myfile.mat');
What variables appear in your workspace? How is your data stored? Are you loading a .txt file or a .mat file? These are important details.
It's easier if you just provide some sample data from your file. Just copy the file, delete 95% of the data and upload it here.
I attached few rows of data. kindly check it. my variable appear mat file only on workspace. I loaded saved mat file only.
Much better, thanks.
%Load data
A=load('myfile1.mat')
A=A.myfile;
%Set tolerance
tol=0.01;
%Make unique sorted matrix
[v,ind]=uniquetol(A,tol,'ByRows',1)
%Make unsorted matrix
out=nan(size(A,1),size(A,2))
out(ind,:)=A(ind,:)
out(isnan(out(:,1)),:)=[]
Reducing the tolerance may give you more uniques. It all depends on how you define a unique value.
@jonas thanks its working for upladed file. But when i apply to large data file. it gives error-- 'Reference to non-existent field 'some file name(ex:-Myfile1)''. Why did you written A = A.myfile because loaded mat file is myfile1
and also here we have three columns but you represent as
out = nan(size(A,1),size(A,2))
when i observe all mat file all rows will be different. so should i write
out = nan(size(A,1),size(A,2),size(A,3))
sorry for your inconvience. and thank in advance for your help.
The file is called myfile1.mat but the variable it contains is called myfile. The current code is correct, don't change it.
When the script stops with an error. Look at your workspace. What variables do you have? Probably you have one called A. Type A in the command window, it will show you its different fields. See attachment. Call the data by:
data=A.fieldname
where fieldname as highlighted in the attachment.
If you want to use matlab effeciently, you need to learn some basics about variables. Learning will be much easier if you know how to ask a question and how to provide sufficient information for that question to be answered effeciently.
https://se.mathworks.com/matlabcentral/answers/6200-tutorial-how-to-ask-a-question-on-answers-and-get-a-fast-answer
Got it.Thank goodness. Really appreciate all of the help you gave me, you rock!
Jonas, the data becomes a struct because of the notation you are using. Simply using
load('myfile.mat');
works fine. This is probably what was confusing Ram.
Paolo, many of us on this forum (and the matlab doc) advise against using plain load as 'popping' variables in the workspace can be the cause of many bugs. It's much better to give a target to load as Jonas had done.
Whether or not you load a mat file directly into the workspace or into a structure, you still need to know the names of the variables in the file and you'll always get an error if you get the name wrong.
Yes of course, I know, I was just clarifying since it was not clear from Jonas's post, which is why OP was getting confused.
@jonas, when i use above code for 54954930*3 matrix. output is 9800*3 But when i use below code
A = load('myfile1.mat'); %myfile consists a large data
A = A.myfile;
out=unique(A,'rows','stable');
out is 166603041*3--- which data is correct??
i want to remove only identical rows in 3 columns
let say, A = 1 1 1;1 1 1; 1 2 1; 1 2 3;
out = 1 1 1; 1 2 1; 1 2 3.
i hope, i am not confusing.
As Paolo previously said, you are working with floating point numbers. That is why you need to use uniquetol and not unique. You need to define what a duplicate is (by setting a tolerance). If you set the tolerance to zero, then you will probably end up with the same result using both functions. However, are 1.0000000 and 1.00000001 duplicates? That's what you define with tolerance, i.e. the max difference between two numbers for them to be defined as same number.
Or simply put. Do you still have duplicates after you use unique? Then you need to use uniquetol to remove rows that are almost identical to other rows.

Sign in to comment.

More Answers (0)

Categories

Asked:

Ram
on 1 Jul 2018

Edited:

on 1 Jul 2018

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!