Filtering max values within every 3 hours of data

Question

0 votes

upload.mat

Hey everyone,

I attached the first 50 rows of a sample dataset. My time data is in the format of '2000-01-01 01:00' where the last 5 characters hold the hour (HH) and minute (MM) values respectively. I would like to have a code that goes through the time column, and assigns the maximum value of the second column between the 01,02,03 hours to the 03 hour timestep. I need to then have this done for the 04 - 06 time period and so forth until 22 - 00.

At the end the cell array would not need to keep data from other columns and should just be XXXX x 2 cell array and just reassigns the maximums to the 03,06,09 ...... 00 timesteps. There may also be missing hours, so the script would need to accommodate for that.

example:

'1984-01-01 00:00'  10.80  240  20.06  231  0.8
'1984-01-01 01:00'  11.83  230  20.06  6  0.6
'1984-01-01 02:00'  11.31  230  18.51  100  0.6
'1984-01-01 03:00'  10.80  230  18.51  212  0.7
'1984-01-01 04:00'  9.77  240  18.65  348  0.8
'1984-01-01 05:00'  9.25  240  16.97  406  0.8
'1984-01-01 06:00'  9.77  230  17.48  524  0.7
'1984-01-01 07:00'  9.77  230  16.97  612  0.7

would then become:

'1984-01-01 03:00'  11.83  
'1984-01-01 06:00'  9.77

I am completely lost in writing this and any help would be appreciated!

1 Comment
Show -1 older comments Hide -1 older comments

Cedric on 11 May 2014

Well, not all answers below will work with missing data, but if you work for understanding them all (with their limitations), you'll learn a lot about data manipulation in MATLAB ;-)

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Cedric on 11 May 2014

Edited: Cedric on 13 May 2014

Open in MATLAB Online

2 votes

You could build an approach along the following line:

 % - Build array of time components. 
 dv = datevec( datatoupload(:,1), 'yyyy-mm-dd hh:MM' ) ;
 % - Build logical index of relevant/boundary hours.
 hrId      = rem( dv(:,4), 3 ) == 0 ;
 hrId(end) = true ;                     % Last entry always relevant.
 % - Build vector of (time-) group IDs.
 hrGrId = cumsum( [true; hrId(1:end-1)] ) ;
 % - Get statistics per group.
 grMax = accumarray( hrGrId, cell2mat(datatoupload(:,2)), [], @max ) ;
 % - Build output.
 output = [datatoupload(hrId,1), num2cell(grMax)] ;

Applied to your MAT file, this gives:

 >> output
 output = 
    '1984-01-01 00:00'    [10.8024]
    '1984-01-01 03:00'    [11.8312]
    '1984-01-01 06:00'    [ 9.7736]
    '1984-01-01 09:00'    [ 9.7736]
    '1984-01-01 12:00'    [ 8.7448]
    '1984-01-01 15:00'    [ 8.7448]
    '1984-01-01 18:00'    [ 9.2592]
    '1984-01-01 21:00'    [ 7.2016]
    '1984-01-02 00:00'    [ 5.1440]
    '1984-01-02 03:00'    [ 4.6296]
    '1984-01-02 06:00'    [ 3.6008]
    '1984-01-02 09:00'    [ 2.5720]
    '1984-01-02 12:00'    [ 7.2016]
    '1984-01-02 15:00'    [10.2880]
    '1984-01-02 18:00'    [11.8312]
    '1984-01-02 21:00'    [12.3456]
    '1984-01-03 00:00'    [10.2880]
    '1984-01-03 01:00'    [10.8024]

EDIT 1: the following solution should work with missing data, but you'll have to understand it and fully test it. In the beginning, I am building a test data set with missing data.

 load( 'masao_1.mat' ) ;
 data = datatoupload ;
 % - Remove rows so for day 1:
 %  > Hours 4,5,6 have only data for hours 5,6.
 %  > Hours 10,11,12 have only data for hours 10, 11. 
 %  > Hours 13,14,15 have no data. 
 data([16,15,14,13,5], :) = [] ;
 % - Build array of time components. 
 dv = datevec( data(:,1), 'yyyy-mm-dd hh:MM' ) ;
 % - Build vector of (time-) group IDs.
 %  1. Absolute hour ID.
 hrId = dv(:,4:-1:1) * cumprod( [1; 24; 32; 12] ) - 1 ;
 %  2. Contiguous group start pos.
 grStart = [true; diff(floor(hrId/3)) ~= 0] ;             
 %  3. Group start pos.
 grEnd = [grStart(2:end); true] ;
 %  4. Contiguous group IDs.
 grId = cumsum( grStart ) ;
 % - Get statistics per group.
 grMax  = accumarray( grId, cell2mat(data(:,2)), [], @max ) ;
 % - Build output.
 output = [data(grEnd,1), num2cell(grMax)] ;

Looking at the output

 >> output
 output = 
    '1984-01-01 00:00'    [10.8024]
    '1984-01-01 03:00'    [11.8312]
    '1984-01-01 06:00'    [ 9.7736]
    '1984-01-01 09:00'    [ 9.7736]
    '1984-01-01 11:00'    [ 8.7448]
    '1984-01-01 18:00'    [ 9.2592]
    '1984-01-01 21:00'    [ 7.2016]
    '1984-01-02 00:00'    [ 5.1440]
    '1984-01-02 03:00'    [ 4.6296]
    '1984-01-02 06:00'    [ 3.6008]
    '1984-01-02 09:00'    [ 2.5720]
    '1984-01-02 12:00'    [ 7.2016]
    '1984-01-02 15:00'    [10.2880]
    '1984-01-02 18:00'    [11.8312]
    '1984-01-02 21:00'    [12.3456]
    '1984-01-03 00:00'    [10.2880]
    '1984-01-03 01:00'    [10.8024]

you see that it seems to be working when missing data hit boundary hours or not, and when they cover full period(s).

EDIT 2: the following solution removes data from groups/periods which don't include 3 entries/measures. Again, you have to fully understand it and test it. You may find an easier way to eliminate entries, I didin't really optimize.

load( 'masao_1.mat' ) ;
data = datatoupload ;
% - Remove rows so for day 1:
%  > Hours 4,5,6 have only data for hours 5,6.
%  > Hours 10,11,12 have only data for hours 10, 11. 
%  > Hours 13,14,15 have no data. 
data([16,15,14,13,5], :) = [] ;
% - Build array of time components. 
dv = datevec( data(:,1), 'yyyy-mm-dd hh:MM' ) ;
% - Build vector of (time-) group IDs.
%  1. Absolute hour ID.
hrId = dv(:,4:-1:1) * cumprod( [1; 24; 32; 12] ) - 1 ;
%  2. Contiguous group start pos.
grStart = [true; diff(floor(hrId/3)) ~= 0] ;             
%  3. Contiguous group IDs, 1st shot, any size taken into account.
grId = cumsum( grStart ) ;
%  4. Flag data which do not belong to a full group (= 3 entries).
grIsValid    = accumarray( grId, ones(size(grId)) ) == 3 ;
entryIsValid = ismember( grId, find(grIsValid) ) ;
%  5. Eliminate invalid entries, rebuild valid group starts and IDs.
data(~entryIsValid,:)    = [] ;
grStart(~entryIsValid,:) = [] ;
grId = cumsum( grStart ) ;
% - Get statistics per group.
grMax = accumarray( grId, cell2mat(data(:,2)), [], @max ) ;
% - Build output.
grEnd  = [grStart(2:end); true] ;
output = [data(grEnd,1), num2cell(grMax)] ;

With that, we obtain..

 >> output
 output = 
    '1984-01-01 03:00'    [11.8312]
    '1984-01-01 09:00'    [ 9.7736]
    '1984-01-01 18:00'    [ 9.2592]
    '1984-01-01 21:00'    [ 7.2016]
    '1984-01-02 00:00'    [ 5.1440]
    '1984-01-02 03:00'    [ 4.6296]
    '1984-01-02 06:00'    [ 3.6008]
    '1984-01-02 09:00'    [ 2.5720]
    '1984-01-02 12:00'    [ 7.2016]
    '1984-01-02 15:00'    [10.2880]
    '1984-01-02 18:00'    [11.8312]
    '1984-01-02 21:00'    [12.3456]
    '1984-01-03 00:00'    [10.2880]

which don't contain summaries when fewer than 3 entries are available.

8 Comments
Show 6 older comments Hide 6 older comments

mashtine on 13 May 2014

Where can I send your trophy my friend? I am really appreciative of the help this community gives! God knows I would still be in the ice age learning. Many thanks Cedric

Cedric on 13 May 2014

Edited: Cedric on 13 May 2014

Open in MATLAB Online

My pleasure, I'm glad it helped!

You can favor simpler approaches when you don't know how to proceed. Basically, you know very well how to proceed (easily) by hand: you would take and discard the first entries until you find the beginning of a group, check that the group is full, take the max, associate it with the time stamp, and then repeat the process. This can quite easily be implemented with a loop and a few IF statements .. it may not be nice to see, not be too efficient either (even though, loops can be quite efficient nowadays), but you'd make it. The code would look like what follows in the great lines:

 nRows = size( data, 1 ) ;
 times = cell( nRows, 1 ) ;  % Overshoot prealloc => trunc in the end.
 maxes = zeros( nRows, 1 ) ; % Overshoot prealloc => trunc in the end.
 inId  = 0 ;
 outId = 0 ;
 while inId <= nRows
    inId = inId + 1 ;
    if "not a start hour"
       continue ;
    end
    if "next 2 entries not contiguous"
       continue ;
    end
    outId = outId + 1 ;
    % Compute max of current and next 2 entries, store in maxes(outId)
    ...
    % Store time stamp in times{outId}.
    ...
 end
 times = times(1:outId) ;    % Truncate to relevant size.
 maxes = maxes(1:outId) ;    % Truncate to relevant size.

Sign in to comment.

Answer 2

dpb on 11 May 2014

Open in MATLAB Online

1 vote

An alternate approach...

>> dat=circshift(dat(:,1:2),-1);   % place the first of groups of three on first row
>> ddat=cell2mat(dat(:,2));        % convert numeric to array for convenience(+)
>> [~,imx]=max(reshape(ddat(1:fix(length(dat)/3)*3),3,[]));  % find max for each group
>> res=dat(imx+[0:3:(length(imx)-1)*3],1:2)                  % select those from full array
res = 
  '1984-01-01 01:00'    [11.8300]
  '1984-01-01 04:00'    [ 9.7700]
>>

I didn't read Cedric's answer carefully enough to unequivocally answer your question re: missing values; I'd suggest trying it for a sample dataset and see how it works.

For correct operation of the above you will need to fill in missing data "holes" if they do exist similarly as to how Cedric built a date vector from your first to last to ensure the "divide by 3" works to build the right number of columns.

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Answer 3

Andrei Bobrov on 11 May 2014

Edited: Andrei Bobrov on 11 May 2014

Open in MATLAB Online

0 votes

load upload.mat
a = datatoupload;
[~,~,~,hh] = datevec(a(:,1));
hh(hh==0) = 24;
h2 = (rem(hh-1,3)+1)==1;
h2(1) = 1;
ii = cumsum(h2);
[~,b] = unique(ii,'last');
out = [a(b,1) accumarray(ii,cat(1,a{:,2}),[],@(x){max(x)})];

add variant

t = datenum(1984,1,1,3*(0:ceil(diff(datenum(a([1,end],1)))*24/3))',0,0);
t0 =  datenum(a(:,1));
t1=t+eps(1e6);
[~,ii] = histc(t0,t1);
out = [cellstr(datestr(t,'yyyy-mm-dd HH:MM')),...
                            accumarray(ii+1,cat(1,a{:,2}),[],@(x){max(x)})];

0 Comments
Show -2 older comments Hide -2 older comments

Sign in to comment.

Filtering max values within every 3 hours of data

1 Comment
Show -1 older comments Hide -1 older comments

Accepted Answer

8 Comments
Show 6 older comments Hide 6 older comments

More Answers (2)

0 Comments
Show -2 older comments Hide -2 older comments

0 Comments
Show -2 older comments Hide -2 older comments

Categories

Tags

Community Treasure Hunt

Filtering max values within every 3 hours of data

1 Comment Show -1 older comments Hide -1 older comments

Accepted Answer

8 Comments Show 6 older comments Hide 6 older comments

More Answers (2)

0 Comments Show -2 older comments Hide -2 older comments

0 Comments Show -2 older comments Hide -2 older comments

Categories

Tags

See Also

Community Treasure Hunt

1 Comment
Show -1 older comments Hide -1 older comments

8 Comments
Show 6 older comments Hide 6 older comments

0 Comments
Show -2 older comments Hide -2 older comments

0 Comments
Show -2 older comments Hide -2 older comments