Loop through the unique values of a very large column and extract data

7 views (last 30 days)

Show older comments

Julian Williams on 15 Jun 2020

0
Link

Direct link to this question

https://uk.mathworks.com/matlabcentral/answers/548415-loop-through-the-unique-values-of-a-very-large-column-and-extract-data

Commented: Julian Williams on 16 Jun 2020

Open in MATLAB Online

This is more a speed question than a "how to" question.

Assume I have the following problem:

Three variables A, B and C.

A is a series of IDs and B and C are data (e.g. dates and a measurement).

For various reasons I want to seperate the data, so instead of three columns I have a structure with something like:

mystruct.First_ID_FROM_A = [B(indexFirstID,:) C(indexFirstID,:)]

Traditionally I just do the following:

[uA,IA,IB] = unique(A);
for i=1:length(uA)
    ii = find(i==IB);
    mystruct.(uA{i,1}) = [B(ii,:) C(ii,:)];
    %sometimes I do other stuff here with some cross referencing so the index ii is useful.
end

Job done. I have tried other methods, but this is pretty fast, except now I have like crazy big data (e.g. A, B and C is like the best part of a billion rows). So this is my second attempt that I run on a server:

[uA,IA,IB] = unique(A);
N = length(uA);
temp = cell(N,1);
% do the indexing with a cell structure that can be cut.
parfor i=1:N
    ii = find(i==IB);
    temp{i,1} = [B(ii,:) C(ii,:)];
end
% do a second loop just to reallocate the data
for i=1:N
    mystruct.(uA{i,1}) = temp{i,1};  
end

So despite being two loops this can be quicker as the extraction is in parallel and the assignment is fast.

Is there a fancy way of using something like an array based version of a binary expansion function that can do this faster without the loop, in either step of the second process? Or should I make a C++ and a mex routine to speed this tedious thing up? I think a problem here is the output array is uncertain in terms of size.

If so does anyone have any experience or examples of how to create and map a Matlab structure in C++ so the output can be read by matlab? I use str2doubleq a lot, this takes cell array of strings and outputs doubles, which is quite vanilla, and I have made a few custom C and C++ codes, for fast date and time pulls, when datenum was too slow.

But this is annoying, me, I am sure there is a neater way to do it. Once the data is in the structure, it is reall fast to just use the fieldnames command and then loop through the sub data objects.

7 Comments
Show 5 older commentsHide 5 older comments

Sindar on 16 Jun 2020

Open in MATLAB Online

The point of tables is that they act like a more organized structure array. If you are naming each structure field, you already spend that memory. Depending on the shape of your data, something similar to:

mytable = array2table([B(IB,:) C(IB,:)],'RowNames',num2str(uA))

should work without any loops

Julian Williams on 16 Jun 2020

Benjamin, that is very neat, much appreciated. Sindar, many thanks for the point on the tables.

Answers (0)

Products

MATLAB

Release

R2019b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Loop through the unique values of a very large column and extract data

7 Comments
Show 5 older commentsHide 5 older comments

Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

Loop through the unique values of a very large column and extract data

7 Comments Show 5 older commentsHide 5 older comments

Answers (0)

See Also

Categories

Tags

Products

Release

Community Treasure Hunt

7 Comments
Show 5 older commentsHide 5 older comments