Splitting a matrix according to there labels

I have a matrix of (1900 x 4 double), fourth column contains labels 3, 2 and 1. I want to split this data in 20:80 ratio of A and B where A contains 20% of each labels 3,2,&1. And B contains 80% of each labels i.e. 80% of label 3, 80% of label 2 and 80% of label 1. Please help how can this be achieved.

6 Comments

How exactly would you split? By what criteria? Also, what do you mean by labels?
Hi Dyuman, thanks for your response.
I want to split by rows.
Each row represents the features of signal. Last column contains labels or classes i.e. 3,2 or 1. So last column of each row contains a label i.e. either 3 or 2 or 1.
I want to divide this data (1900 x 4 double) in two parts A (xyz x 4 double) & B (uvw x 4 double) for training and testing. A containing 20% data and B containing 80% data (where A contains 20% rows of each from 3, 2 & 1 labes similarly B will contain 80% rows of each from 3, 2 & 1 labes).
Look at findgroups and nchoosek for ideas...
By split, I meant how exactly to allot the 20% and 80%? Randomly or by a criteria?
Randomly.
But,
A (containing 20% of data rows) should contain [20% from label 3 rows + 20% from label 2 rows + 20% from label 1 rows].
B (containing 80% of data rows) should contain [80% from label 3 rows + 80% from label 2 rows + 80% from label 1 rows].
Add splitapply or if using table rowfun to above...

Sign in to comment.

 Accepted Answer

Jon
Jon on 10 May 2022
Edited: Jon on 10 May 2022
This is one way to do it
% make an example data file with last column having either a "label" of 1,
% 2, or 3
data = [rand(1900,3),randi(3,[1900,1])];
% loop through labels making training and validation data sets
Aparts = cell(3,1);
Bparts = cell(3,1);
for k = 1:3
% get the indices of the rows with kth label
idx = find(data(:,4)==k);
numWithLabel = numel(idx);
idxrand = idx(randperm(numWithLabel)); % randomize the selection
% randomly put (within rounding) 80% in training, 20% in validation
numTrain = round(0.8*numWithLabel);
Aparts{k} = data(idxrand(1:numTrain),:);
Bparts{k} = data(idxrand(numTrain+1:end),:); % the rest go to validation
end
% put all of the parts in one matrix of doubles
A = cell2mat(Aparts);
B = cell2mat(Bparts);

13 Comments

Hi Jon, thanks for your reponse.
My early data changed to Combined_Data = (7886 x 8 double). Last column i.e. column 8 contains labels 3,2,1.
I tried this below code but it does not give the desired result. can you please have a check on this.
Desired ouput:
matrix A(xyz x 8 double)
matrix B(uvw x 8 double)
A (containing 20% of data rows) should contain [20% from label 3 rows + 20% from label 2 rows + 20% from label 1 rows].
B (containing 80% of data rows) should contain [80% from label 3 rows + 80% from label 2 rows + 80% from label 1 rows].
code:
filename = 'C.xlsx';
Combined_Data = xlsread(filename);
% loop through labels making training and validation data sets
Aparts = cell(7,1);
Bparts = cell(7,1);
for k = 1:7
idx = find(Combined_Data(:,8)==k);
numWithLabel = numel(idx);
idxrand = idx(randperm(numWithLabel)); % randomize the selection
% randomly put (within rounding) 80% in training, 20% in validation
numTrain = round(0.8*numWithLabel);
Aparts{k} = Combined_Data(idxrand(1:numTrain),:);
Bparts{k} = Combined_Data(idxrand(numTrain+1:end),:); % the rest go to validation
end
% put all of the parts in one matrix of doubles
A = cell2mat(Aparts);
B = cell2mat(Bparts);
The only thing that needs to change is the column that the labels are in. You are still looping through three possible labels 1,2,3 so don't change the dimensions that have to do with the label you are looping through
Combined_Data = xlsread(filename);
% loop through labels making training and validation data sets
Aparts = cell(3,1); % first dimension is number of labels
Bparts = cell(3,1);
for k = 1:3 % just loop through the labels
idx = find(Combined_Data(:,8)==k);
numWithLabel = numel(idx);
idxrand = idx(randperm(numWithLabel)); % randomize the selection
% randomly put (within rounding) 80% in training, 20% in validation
numTrain = round(0.8*numWithLabel);
Aparts{k} = Combined_Data(idxrand(1:numTrain),:);
Bparts{k} = Combined_Data(idxrand(numTrain+1:end),:); % the rest go to validation
end
% put all of the parts in one matrix of doubles
A = cell2mat(Aparts);
B = cell2mat(Bparts);
Ya this is working.
Thanks a lot Jon!!
Glad it works. If you anticpate that the number of labels/and or the column that the labels are in may change in the future, it would be a good idea to put those in as variables and just assign them at the top of the code (or if you made into a function as a function argument).
Ya, this can be a good alternate.
In case it is of interest, here is a much simpler way to do the splitting without using any loops. I also made it more general so that you can use any list of labels you want, they don't even have to be consecutive, and there can be an arbitrary number of labels
% parameters
numPoints = 1900; % only needed to generate example data
labelColNo = 8; % column number of labels
labels = [1,2,3]; % possible labels
% make an example data file with last column having random label values
% from set of possible labels
numLabels = numel(labels);
labelColumn = labels(randi(numLabels,numPoints,1));
data = [rand(numPoints,labelColNo-1),labelColumn(:)];
% randomize (shuffle) the rows
data = data(randperm(numPoints),:);
% make a number of data points by number of possible label values
% matrix with a column for each label, whose i,jth entry is true if
% the jth label occurs in the ith row of the labelColumn
isLabel = data(:,labelColNo)==labels;
% count entries and normalize to get cumulative fractions
% record cumulative fraction corresponding to each occurence of label
counts = cumsum(isLabel);
f = counts./counts(end,:).*isLabel; % sets values to zero where label doesn't occur
% mark all the rows with up to 0.8 as being in the training set
isTraining = any(f>0 &f<=0.8,2);
A = data(isTraining,:);
B = data(~isTraining,:);
Thats great!!!
I have a doubt not releated to this Question.
I have a column matrix say Labels = (1900 x 1 double) containing labels 3, 2 and 1. I want to rename all these numbers in this matrix as - 1 as good, 2 as average and 3 as bad. Can you please help me out in this.
Which goes to which? Is category 1 best or worst?
Either way, it's just a linear transforation -- remember "y = mx+b" from HS algebra?
You can calculate this one by hand or if you're lazy use polyfit/polyval.
So you want a text label instead?
You could do this to make a 1900 x 1 cell array
labelText = {'good','average','bad'}; % 1 is good, 2 is average, 3 is bad
textLabels = labelText(Labels)'; % need ' to make into a column
1 for good , 2 for average, 3 for bad.
Current label matrix looks like this (1900 x 1 double)
Current_Label_Matrix = [1,1,1,,,,,,,,,,,,,,,,1,1,2,2,2,2,,,,,,,,,,,,2,2,2,2,2,2,3,3,3,3,3,3,3,3,,,,,,,,,3,3,3,]
In new matrix I want to replace these numbers by string i.e. good average and bad of same dimension as Current_Label_Matrix
For_new_matrix = [good,,,,,,,,good,average,,,,,,,,average,,,bad,,,,,,,,,,,,,bad........]
@Jon This worked. Thank yoy so muchh!!!
Oh, if you want categorical labels, then use categorical variables -- that's what its for...
labels=randi(3,10,1); % dummy dataset for show...
labels=categorical(labels,[1:3],{'Good','Average','Bad'},'ordinal',1); % convert to categorical
labels =
10×1 categorical array
Bad
Good
Bad
Average
Average
Bad
Bad
Good
Bad
Good
>>
Plots are aware of categorical variables so you get the labels automagically; you may have to use
>> categories(labels)
ans =
3×1 cell array
{'Good' }
{'Avgerage'}
{'Bad' }
>>
or string or cellstr occasionally to get a string representation if need it specifically.
But, manipulating table data as categorical instead of as string is far easier and more effiicient besides.
While I showed as a standalone new variable called labels, what you really want to do is convert the actual variable to categorical and use it instead of the original...then the labels come along for free.
@dpb Thanks I realize I need to get more familiar with categorical variables. From your example, and I think another one I saw recently I see that they provide some powerful capabilities.

Sign in to comment.

More Answers (1)

dpb
dpb on 10 May 2022
Edited: dpb on 10 May 2022
[ix,idx]=findgroups(X(:,4)); % get grouping variable on fourth column X
for i=idx.' % for each group ID (must be numeric as here)
I=I(find(ix==i)); % the indices into X for the group
N=numel(I); % how many in this group
I=I(randperm(N)); % rearrange randomly the elements of index vector
nA=floor(0.8*N); % how many to pick for A (maybe round() instead???)
iA{i}=I(1:nA); % the randomized selection for A
iB{i}=I(nA+1:end); % rest for B
end

5 Comments

Jon
Jon on 10 May 2022
Edited: Jon on 10 May 2022
This looks similar to what I have posted except to use findgroups, but I don't see where you are accumulating any results. Won't the values of iA and iB you compute in your loop be overwritten with each loop iteration, or am I missing something?
dpb
dpb on 10 May 2022
Edited: dpb on 10 May 2022
As is; yes...it simply illustrates the steps.
I had the loop in an arrayfun construct locally so they were returned as cell arrays automagically, But the complexity of the anonymous function was such as figured not ideal to post so converted to conventional loop adding intermediaries but didn't add the explicit indices.
Add an {i} to each to save the subscript arrays or "do whatever" with them inside the loop before moving on to the next iteration, user choice.
ADDENDUM:
Made correction to Answer; including fixup for another variable name change missed in converting the anonymous function before...
As noted same functionality/idea as @Jon with only slightly different syntax.
Hi @dpb, thanks for your response. My dimension of data changed, so new data is Combined_Data (7886 x 8 double).
Desired ouput:
matrix A(xyz x 8 double)
matrix B(uvw x 8 double)
A (containing 20% of Combined_Data rows) should contain [20% from label 3 rows + 20% from label 2 rows + 20% from label 1 rows].
B (containing 80% of Combined_Data rows) should contain [80% from label 3 rows + 80% from label 2 rows + 80% from label 1 rows].
I tried this code but its not working for me, please help where am I wrong:
[ix,idx]=findgroups(Combined_Data(:,8));
for i=idx
I=find(ix==i);
N=numel(I);
I=randperm(N);
nA=floor(0.8*N);
iA{i}=I(1:nA);
iB{i}=I(nA+1:end);
end
Error: Arrays have incompatible sizes for this operation.
You've got a missing ".'" transpose operator on the for loop iterator -- it must be a row vector; passing a column vector will result in the problem that all three indices are passed at once. I could have made the code more robust by writing
for i=idx(:).'
instead which (:) forces a column vector and ".'" turns it into row.
However, I see I missed an important step in the cleanup from the anonymous function version -- the line
I=randperm(N);
needs to be
I=I(randperm(N));
to rearrange the subset indices to the grouped variables; the randperm(N) call simply generates the right length of vector subscripts in a random order; still need the actual subscripts from the matching operation of finding the ones in the given group.
With those corrections, it should work as is...cleanest would be to copy and paste the actual code instead of retyping; then you also get indenting and comments and all... :)
I did make the above correction in the Answer code...sorry I missed that first time; glad there was another issue that you reposted so had the chance to see it! :)
-cleanest would be to copy and paste the actual code instead of retyping; then you also get indenting and comments and all... :)
Yeah, I should have done it in tha way.
Thanks @dpb for your help!

Sign in to comment.

Products

Release

R2021a

Asked:

on 10 May 2022

Commented:

Jon
on 11 May 2022

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!