Efficient script to isolate one sub-dataset k-times.

5 views (last 30 days)
Vic on 3 Mar 2024
Commented: Vic on 7 Mar 2024
Hi everyone,
The idea is to divide the main dataset into k sub-datasets and delete 1 bin each time and remerge the other sub-datasets. In a nutshell, k bins will create k different sub-datasets. Since the number of bins mays not be a multiple of the number of row in the matrix (Bin k has often less rows), I had to use cell arrays.
Here is an illustration of the general idea for k = 2.
Question:
How can I remove the loop or make this code more efficient?
Here is my script.
------------------------------------------------------
Variables = rand(245,57);
Bin_numb = 11;
Bin_size = [1:floor(length(Variables)/Bin_numb):length(Variables) length(Variables)];
for i = 1:length(Bin_size)-1
if i == 1
Bin_Variables2{1} = Variables(Bin_size(2):Bin_size(end),:);
else
Bin_Variables2{i} = [Variables(Bin_size(1):Bin_size(i)-1,:); Variables(Bin_size(i+1):Bin_size(end),:)];
end
end
Voss on 5 Mar 2024
Edited: Voss on 5 Mar 2024
Two observations:
1. The last row of Variables is included as the last row of every element of Bin_Variables2 (because Bin_size(end) is always included).
2. When size(Variables,1) is a multiple of Bin_numb, I expect you'd want each element of Bin_Variables2 to be the same size, but that's not what happens.
To illustrate:
Variables = rand(242,7);
Bin_numb = 11;
Bin_size = [1:floor(length(Variables)/Bin_numb):length(Variables) length(Variables)];
for i = 1:length(Bin_size)-1
if i == 1
Bin_Variables2{1} = Variables(Bin_size(2):Bin_size(end),:);
else
Bin_Variables2{i} = [Variables(Bin_size(1):Bin_size(i)-1,:); Variables(Bin_size(i+1):Bin_size(end),:)];
end
end
Observation 1: last row always the same:
fprintf('%36s%s\n','Last row of Variables: ',sprintf('%6.4g ',Variables(end,:)));
Last row of Variables: 0.02797 0.5595 0.2128 0.4162 0.0364 0.1367 0.6156
for ii = 1:numel(Bin_Variables2)
fprintf('%36s%s\n',sprintf('Last row of Bin_Variables2{%d}: ',ii),sprintf('%6.4g ',Bin_Variables2{ii}(end,:)));
end
Last row of Bin_Variables2{1}: 0.02797 0.5595 0.2128 0.4162 0.0364 0.1367 0.6156 Last row of Bin_Variables2{2}: 0.02797 0.5595 0.2128 0.4162 0.0364 0.1367 0.6156 Last row of Bin_Variables2{3}: 0.02797 0.5595 0.2128 0.4162 0.0364 0.1367 0.6156 Last row of Bin_Variables2{4}: 0.02797 0.5595 0.2128 0.4162 0.0364 0.1367 0.6156 Last row of Bin_Variables2{5}: 0.02797 0.5595 0.2128 0.4162 0.0364 0.1367 0.6156 Last row of Bin_Variables2{6}: 0.02797 0.5595 0.2128 0.4162 0.0364 0.1367 0.6156 Last row of Bin_Variables2{7}: 0.02797 0.5595 0.2128 0.4162 0.0364 0.1367 0.6156 Last row of Bin_Variables2{8}: 0.02797 0.5595 0.2128 0.4162 0.0364 0.1367 0.6156 Last row of Bin_Variables2{9}: 0.02797 0.5595 0.2128 0.4162 0.0364 0.1367 0.6156 Last row of Bin_Variables2{10}: 0.02797 0.5595 0.2128 0.4162 0.0364 0.1367 0.6156 Last row of Bin_Variables2{11}: 0.02797 0.5595 0.2128 0.4162 0.0364 0.1367 0.6156
Observation 2: unequally sized result matrices even though 242 is a multiple of 11:
bin_sizes = cellfun(@(x)size(x,1),Bin_Variables2)
bin_sizes = 1×11
220 220 220 220 220 220 220 220 220 220 221
Vic on 7 Mar 2024
@Voss Thanks for these observations. @Manikanta Aditya & @Dyuman Joshi Thanks for your help. I haven't thought about the logical array. This is an elegant way to solve it.
Here is my current script.
Variables = rand(245,7);
Bin_numb = 11;
Bin_size = 1:floor(length(Variables)/Bin_numb):length(Variables);
if length(Variables)-Bin_size(end) <= 12
Bin_size(end) = length(Variables);
end
Bin_Variables2 = cell(1, length(Bin_size)-1);
for i = 1:length(Bin_size)-1
idx = true(length(Variables), 1);
idx(Bin_size(i):Bin_size(i+1)) = false;
Bin_Variables2{i} = Variables(idx, :);
end
for ii = 1:numel(Bin_Variables2)
fprintf('%1s%s\n',sprintf('Last row {%d}: ',ii),sprintf('%6.4g ',Bin_Variables2{ii}(end,:)));
end
bin_sizes = cellfun(@(x)size(x,1),Bin_Variables2)
length(Variables)-bin_sizes
Bin_size
Unrecognized function or variable 'Variables'.
Invalid expression. Check for missing or extra characters.
I forced a if condition to change Bin_size(end) = length(Variables) if size(Variables,1) is not a multiple of Bin_numb. Therefore, the last bin has floor(length(Variables)/Bin_numb) + mod(length(Variables),Bin_numb) rows (22+3) and I get this:
bin_sizes =
222 222 222 222 222 222 222 222 222 222 220
length(Variables)-bin_sizes =
23 23 23 23 23 23 23 23 23 23 25
It works.
As of the last row always being the same; it seems to be fine now but I still have some doubts about bin N-1 and its size.
Last row {1}: 0.6559 0.4365 0.5963 0.3045 0.6676 0.5343 0.5316
Last row {2}: 0.6559 0.4365 0.5963 0.3045 0.6676 0.5343 0.5316
Last row {3}: 0.6559 0.4365 0.5963 0.3045 0.6676 0.5343 0.5316
Last row {4}: 0.6559 0.4365 0.5963 0.3045 0.6676 0.5343 0.5316
Last row {5}: 0.6559 0.4365 0.5963 0.3045 0.6676 0.5343 0.5316
Last row {6}: 0.6559 0.4365 0.5963 0.3045 0.6676 0.5343 0.5316
Last row {7}: 0.6559 0.4365 0.5963 0.3045 0.6676 0.5343 0.5316
Last row {8}: 0.6559 0.4365 0.5963 0.3045 0.6676 0.5343 0.5316
Last row {9}: 0.6559 0.4365 0.5963 0.3045 0.6676 0.5343 0.5316
Last row {10}: 0.6559 0.4365 0.5963 0.3045 0.6676 0.5343 0.5316
Last row {11}: 0.1865 0.9516 0.07304 0.0887 0.697 0.9751 0.5142

Manikanta Aditya on 4 Mar 2024
Moved: Dyuman Joshi on 4 Mar 2024
Just check out this code snippet which I can propose to make the code more efficient by using logical indexing instead of a loop:
Variables = rand(245,57);
Bin_numb = 11;
Bin_size = [1:floor(length(Variables)/Bin_numb):length(Variables) length(Variables)];
Bin_Variables2 = cell(1, length(Bin_size)-1);
for i = 1:length(Bin_size)-1
idx = true(size(Variables, 1), 1);
idx(Bin_size(i):Bin_size(i+1)-1) = false;
Bin_Variables2{i} = Variables(idx, :);
end
In this code, 'idx' is a logical array that is true for the rows of Variables that you want to keep. This approach avoids the need to concatenate arrays, which can be slow in MATLAB because it involves memory allocation. Instead, you’re just creating a logical index and using it to select the rows you want.
Dyuman Joshi on 4 Mar 2024
Edited: Dyuman Joshi on 4 Mar 2024
@Manikanta Aditya, This looks good, though I would suggest to use size(Bin_size,1) instead of length(Bin_size).
" ... by using logical indexing instead of a loop:"
You are still using a loop.
@Vic, an important part of the code above is Preallocation, which is a good programming practice in MATLAB resulting in improved code performance.
Manikanta Aditya on 4 Mar 2024
Thanks @Dyuman Joshi for the reply back. My bad I didn't see the statement about the loop.

Categories

Find more on Financial Toolbox in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!