Filtering max values within every 3 hours of data

Hey everyone,
I attached the first 50 rows of a sample dataset. My time data is in the format of '2000-01-01 01:00' where the last 5 characters hold the hour (HH) and minute (MM) values respectively. I would like to have a code that goes through the time column, and assigns the maximum value of the second column between the 01,02,03 hours to the 03 hour timestep. I need to then have this done for the 04 - 06 time period and so forth until 22 - 00.
At the end the cell array would not need to keep data from other columns and should just be XXXX x 2 cell array and just reassigns the maximums to the 03,06,09 ...... 00 timesteps. There may also be missing hours, so the script would need to accommodate for that.
example:
'1984-01-01 00:00' 10.80 240 20.06 231 0.8
'1984-01-01 01:00' 11.83 230 20.06 6 0.6
'1984-01-01 02:00' 11.31 230 18.51 100 0.6
'1984-01-01 03:00' 10.80 230 18.51 212 0.7
'1984-01-01 04:00' 9.77 240 18.65 348 0.8
'1984-01-01 05:00' 9.25 240 16.97 406 0.8
'1984-01-01 06:00' 9.77 230 17.48 524 0.7
'1984-01-01 07:00' 9.77 230 16.97 612 0.7
would then become:
'1984-01-01 03:00' 11.83
'1984-01-01 06:00' 9.77
I am completely lost in writing this and any help would be appreciated!

1 Comment

Well, not all answers below will work with missing data, but if you work for understanding them all (with their limitations), you'll learn a lot about data manipulation in MATLAB ;-)

Sign in to comment.

 Accepted Answer

Cedric
Cedric on 11 May 2014
Edited: Cedric on 13 May 2014
You could build an approach along the following line:
% - Build array of time components.
dv = datevec( datatoupload(:,1), 'yyyy-mm-dd hh:MM' ) ;
% - Build logical index of relevant/boundary hours.
hrId = rem( dv(:,4), 3 ) == 0 ;
hrId(end) = true ; % Last entry always relevant.
% - Build vector of (time-) group IDs.
hrGrId = cumsum( [true; hrId(1:end-1)] ) ;
% - Get statistics per group.
grMax = accumarray( hrGrId, cell2mat(datatoupload(:,2)), [], @max ) ;
% - Build output.
output = [datatoupload(hrId,1), num2cell(grMax)] ;
Applied to your MAT file, this gives:
>> output
output =
'1984-01-01 00:00' [10.8024]
'1984-01-01 03:00' [11.8312]
'1984-01-01 06:00' [ 9.7736]
'1984-01-01 09:00' [ 9.7736]
'1984-01-01 12:00' [ 8.7448]
'1984-01-01 15:00' [ 8.7448]
'1984-01-01 18:00' [ 9.2592]
'1984-01-01 21:00' [ 7.2016]
'1984-01-02 00:00' [ 5.1440]
'1984-01-02 03:00' [ 4.6296]
'1984-01-02 06:00' [ 3.6008]
'1984-01-02 09:00' [ 2.5720]
'1984-01-02 12:00' [ 7.2016]
'1984-01-02 15:00' [10.2880]
'1984-01-02 18:00' [11.8312]
'1984-01-02 21:00' [12.3456]
'1984-01-03 00:00' [10.2880]
'1984-01-03 01:00' [10.8024]
EDIT 1: the following solution should work with missing data, but you'll have to understand it and fully test it. In the beginning, I am building a test data set with missing data.
load( 'masao_1.mat' ) ;
data = datatoupload ;
% - Remove rows so for day 1:
% > Hours 4,5,6 have only data for hours 5,6.
% > Hours 10,11,12 have only data for hours 10, 11.
% > Hours 13,14,15 have no data.
data([16,15,14,13,5], :) = [] ;
% - Build array of time components.
dv = datevec( data(:,1), 'yyyy-mm-dd hh:MM' ) ;
% - Build vector of (time-) group IDs.
% 1. Absolute hour ID.
hrId = dv(:,4:-1:1) * cumprod( [1; 24; 32; 12] ) - 1 ;
% 2. Contiguous group start pos.
grStart = [true; diff(floor(hrId/3)) ~= 0] ;
% 3. Group start pos.
grEnd = [grStart(2:end); true] ;
% 4. Contiguous group IDs.
grId = cumsum( grStart ) ;
% - Get statistics per group.
grMax = accumarray( grId, cell2mat(data(:,2)), [], @max ) ;
% - Build output.
output = [data(grEnd,1), num2cell(grMax)] ;
Looking at the output
>> output
output =
'1984-01-01 00:00' [10.8024]
'1984-01-01 03:00' [11.8312]
'1984-01-01 06:00' [ 9.7736]
'1984-01-01 09:00' [ 9.7736]
'1984-01-01 11:00' [ 8.7448]
'1984-01-01 18:00' [ 9.2592]
'1984-01-01 21:00' [ 7.2016]
'1984-01-02 00:00' [ 5.1440]
'1984-01-02 03:00' [ 4.6296]
'1984-01-02 06:00' [ 3.6008]
'1984-01-02 09:00' [ 2.5720]
'1984-01-02 12:00' [ 7.2016]
'1984-01-02 15:00' [10.2880]
'1984-01-02 18:00' [11.8312]
'1984-01-02 21:00' [12.3456]
'1984-01-03 00:00' [10.2880]
'1984-01-03 01:00' [10.8024]
you see that it seems to be working when missing data hit boundary hours or not, and when they cover full period(s).
EDIT 2: the following solution removes data from groups/periods which don't include 3 entries/measures. Again, you have to fully understand it and test it. You may find an easier way to eliminate entries, I didin't really optimize.
load( 'masao_1.mat' ) ;
data = datatoupload ;
% - Remove rows so for day 1:
% > Hours 4,5,6 have only data for hours 5,6.
% > Hours 10,11,12 have only data for hours 10, 11.
% > Hours 13,14,15 have no data.
data([16,15,14,13,5], :) = [] ;
% - Build array of time components.
dv = datevec( data(:,1), 'yyyy-mm-dd hh:MM' ) ;
% - Build vector of (time-) group IDs.
% 1. Absolute hour ID.
hrId = dv(:,4:-1:1) * cumprod( [1; 24; 32; 12] ) - 1 ;
% 2. Contiguous group start pos.
grStart = [true; diff(floor(hrId/3)) ~= 0] ;
% 3. Contiguous group IDs, 1st shot, any size taken into account.
grId = cumsum( grStart ) ;
% 4. Flag data which do not belong to a full group (= 3 entries).
grIsValid = accumarray( grId, ones(size(grId)) ) == 3 ;
entryIsValid = ismember( grId, find(grIsValid) ) ;
% 5. Eliminate invalid entries, rebuild valid group starts and IDs.
data(~entryIsValid,:) = [] ;
grStart(~entryIsValid,:) = [] ;
grId = cumsum( grStart ) ;
% - Get statistics per group.
grMax = accumarray( grId, cell2mat(data(:,2)), [], @max ) ;
% - Build output.
grEnd = [grStart(2:end); true] ;
output = [data(grEnd,1), num2cell(grMax)] ;
With that, we obtain..
>> output
output =
'1984-01-01 03:00' [11.8312]
'1984-01-01 09:00' [ 9.7736]
'1984-01-01 18:00' [ 9.2592]
'1984-01-01 21:00' [ 7.2016]
'1984-01-02 00:00' [ 5.1440]
'1984-01-02 03:00' [ 4.6296]
'1984-01-02 06:00' [ 3.6008]
'1984-01-02 09:00' [ 2.5720]
'1984-01-02 12:00' [ 7.2016]
'1984-01-02 15:00' [10.2880]
'1984-01-02 18:00' [11.8312]
'1984-01-02 21:00' [12.3456]
'1984-01-03 00:00' [10.2880]
which don't contain summaries when fewer than 3 entries are available.

8 Comments

I wish I could accept this twice! Thanks a lot as this seems to really do the trick. Quick question if you don't mind ... will this still work if say some hours are missing? Going through it I am leaning towards yes because the first row only contains one hour from the 22,23,00 group and it used it none the less as it deemed it the 'max'. I just wanted to confirm that if my timeseries went:
00 01 03 04 05 06 07 13 14 15
Would it still function.
Again, I think you got it covered so i'm leaning towards yes.
Cedric
Cedric on 11 May 2014
Edited: Cedric on 11 May 2014
It would not, simply because of the way group IDs are built. If one (or more) of these boundary hours is missing, the cumulative sum won't generate correct group IDs. Give me a few minutes and I will update my answer.
Juts one question, when there are missing boundary hours, you associate the statistics to the last available entry of the group, or you discard the data? E.g. if you have
'1984-01-01 01:00' 11.83 230 20.06 6 0.6
'1984-01-01 02:00' 11.31 230 18.51 100 0.6
but
'1984-01-01 03:00' 10.80 230 18.51 212 0.7
is missing, do you need to get
'1984-01-01 02:00' 11.83 <= max among 1am and 2am values.
or should these 1am and 2am data be discarded?
Ahhh, good question. I am trying to validate the observational data with modelled output. The modelled output however gives the max wind speed over 3 hours where the measure gave max wind speed every hour. Thus, if I take the max of 3 hours, it is the max within 3 hours like the model give. However, an 'ideal' solution would be to only use time period where all 3 slots are available thus to answer the above, ideally, I should eliminate the 03 and just print 00 and 06 and so forth. There isn't a lot of hours missing but I do not want to create bias in the statistics
Hi Cedric,
Any thought of what I last commented? Thanks
Where can I send your trophy my friend? I am really appreciative of the help this community gives! God knows I would still be in the ice age learning. Many thanks Cedric
My pleasure, I'm glad it helped!
You can favor simpler approaches when you don't know how to proceed. Basically, you know very well how to proceed (easily) by hand: you would take and discard the first entries until you find the beginning of a group, check that the group is full, take the max, associate it with the time stamp, and then repeat the process. This can quite easily be implemented with a loop and a few IF statements .. it may not be nice to see, not be too efficient either (even though, loops can be quite efficient nowadays), but you'd make it. The code would look like what follows in the great lines:
nRows = size( data, 1 ) ;
times = cell( nRows, 1 ) ; % Overshoot prealloc => trunc in the end.
maxes = zeros( nRows, 1 ) ; % Overshoot prealloc => trunc in the end.
inId = 0 ;
outId = 0 ;
while inId <= nRows
inId = inId + 1 ;
if "not a start hour"
continue ;
end
if "next 2 entries not contiguous"
continue ;
end
outId = outId + 1 ;
% Compute max of current and next 2 entries, store in maxes(outId)
...
% Store time stamp in times{outId}.
...
end
times = times(1:outId) ; % Truncate to relevant size.
maxes = maxes(1:outId) ; % Truncate to relevant size.

Sign in to comment.

More Answers (2)

An alternate approach...
>> dat=circshift(dat(:,1:2),-1); % place the first of groups of three on first row
>> ddat=cell2mat(dat(:,2)); % convert numeric to array for convenience(+)
>> [~,imx]=max(reshape(ddat(1:fix(length(dat)/3)*3),3,[])); % find max for each group
>> res=dat(imx+[0:3:(length(imx)-1)*3],1:2) % select those from full array
res =
'1984-01-01 01:00' [11.8300]
'1984-01-01 04:00' [ 9.7700]
>>
I didn't read Cedric's answer carefully enough to unequivocally answer your question re: missing values; I'd suggest trying it for a sample dataset and see how it works.
For correct operation of the above you will need to fill in missing data "holes" if they do exist similarly as to how Cedric built a date vector from your first to last to ensure the "divide by 3" works to build the right number of columns.
load upload.mat
a = datatoupload;
[~,~,~,hh] = datevec(a(:,1));
hh(hh==0) = 24;
h2 = (rem(hh-1,3)+1)==1;
h2(1) = 1;
ii = cumsum(h2);
[~,b] = unique(ii,'last');
out = [a(b,1) accumarray(ii,cat(1,a{:,2}),[],@(x){max(x)})];
add variant
t = datenum(1984,1,1,3*(0:ceil(diff(datenum(a([1,end],1)))*24/3))',0,0);
t0 = datenum(a(:,1));
t1=t+eps(1e6);
[~,ii] = histc(t0,t1);
out = [cellstr(datestr(t,'yyyy-mm-dd HH:MM')),...
accumarray(ii+1,cat(1,a{:,2}),[],@(x){max(x)})];

Categories

Find more on Statistics and Machine Learning Toolbox in Help Center and File Exchange

Asked:

on 11 May 2014

Edited:

on 13 May 2014

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!