Fill a timetable outside the gaps
13 views (last 30 days)
Show older comments
Consider the following timetable, which is just a toy to explain the problem
dates = [datetime(2020,1,1) datetime(2020,1,1) datetime(2020,1,2) datetime(2020,1,2) datetime(2020,1,3) datetime(2020,1,3) datetime(2020,1,4)];
t = timetable(dates',["a" "c" "a" "b" "a" "a" "c"]',[2 1 3 2 1 2 1]')
Time Var1 Var2
___________ ____ ____
01-Jan-2020 "a" 2
01-Jan-2020 "c" 1
02-Jan-2020 "a" 3
02-Jan-2020 "b" 2
03-Jan-2020 "a" 1
03-Jan-2020 "a" 2
04-Jan-2020 "c" 1
For each date and for each unique string in Var1 I would like to obtain the sum of the values in Var2. That is:
(1) if for a specific date a string does not appear, then 0 must be returned
(2) if for a specific date a string appears only once, then the corresponding value in Var2 must be returned
(3) if for a specific date a string appears multiple times, then the sum of the corresponding values in Var2 must be returned
The function retime satisfy (2) and (3) completely, but (1) only partially in the sense that it returns 0 only if the string appears in a previous date AND in a following date, that is retime only fills the gaps.
In the following the row with 04-Jan-2020 and 0 is missing
retime(t(t.Var1=="a","Var2"),'daily','sum')
Time Var2
___________ ____
01-Jan-2020 2
02-Jan-2020 3
03-Jan-2020 3
In the following the rows with dates 1, 3, 4 january (and 0 in Var2) are missing
retime(t(t.Var1=="b","Var2"),'daily','sum')
Time Var2
___________ ____
02-Jan-2020 2
The following is correct
retime(t(t.Var1=="c","Var2"),'daily','sum')
Time Var2
___________ ____
01-Jan-2020 1
02-Jan-2020 0
03-Jan-2020 0
04-Jan-2020 1
Is there an easy way to resolve the problem and without using for loops? The data I'm working with has thousands of rows and tens of columns, I tried with loops but it's time consuming.
In the toy example an easy workaround is to manually add the missing strings at the start and at the end of the timetable, in such a way we can extend the gaps to cover all the datetimes and then retime can fill all the missing values.
However, this is easy only for timetables with a simple structure, moreover it might have some unwanted side effects
t = [timetable(datetime(2020,1,1),"b",0) ; t ; timetable([datetime(2020,1,4);datetime(2020,1,4)],["a";"b"],[0;0])]
Time Var1 Var2
___________ ____ ____
01-Jan-2020 "b" 0
01-Jan-2020 "a" 2
01-Jan-2020 "c" 1
02-Jan-2020 "a" 3
02-Jan-2020 "b" 2
03-Jan-2020 "a" 1
03-Jan-2020 "a" 2
04-Jan-2020 "c" 1
04-Jan-2020 "a" 0
04-Jan-2020 "b" 0
0 Comments
Accepted Answer
Siddharth Bhutiya
on 19 May 2021
In your case your grouping variables would be Time and Var1 and the aggregation method you want to use would be sum. Since you want all permutations of the grouping variables to show up in your output, you can specify the "IncludeEmptyGroups" as true and that should give you the desired output.
groupsummary(t,["Time","Var1"],"sum","Var2","IncludeEmptyGroups",true)
ans =
12×4 table
Time Var1 GroupCount sum_Var2
___________ ____ __________ ________
01-Jan-2020 "a" 1 2
01-Jan-2020 "b" 0 0
01-Jan-2020 "c" 1 1
02-Jan-2020 "a" 1 3
02-Jan-2020 "b" 1 2
02-Jan-2020 "c" 0 0
03-Jan-2020 "a" 2 3
03-Jan-2020 "b" 0 0
03-Jan-2020 "c" 0 0
04-Jan-2020 "a" 0 0
04-Jan-2020 "b" 0 0
04-Jan-2020 "c" 1 1
5 Comments
Siddharth Bhutiya
on 20 May 2021
Edited: Siddharth Bhutiya
on 20 May 2021
@giannit When I first read your comment about missing date my thought was well for groupsummary Time is just a grouping variable, so as far as it is concerned there is no difference between 02-Jan-2020 or 5-Jan-2020 or even 20-May-2021, all these are just values for it so what should it consider as a missing value?
But then I thought about it some more and took another look at the problem and it made more sense. Technically 02-Jan-2020 is not missing but in your final output you want the Time variable to be in increments of 1 day starting from min(Time) to the max(Time). This seemed like a perfectly reasonable thing to do when someone is trying to summarize their data, so I went back to the documentation page for groupsummary and found the groupbins argument. So it seems that you can get the desired results by doing the following
T = timetable(datetime(2020,1,[1 1 1 3 3 4])',["a" "c" "b" "a" "a" "c"]',[2 1 2 1 2 1]');
groupsummary(T,["Time","Var1"],["day","none"],"sum","Var2","IncludeEmptyGroups",true)
ans =
12×4 table
day_Time Var1 GroupCount sum_Var2
___________ ____ __________ ________
01-Jan-2020 "a" 1 2
01-Jan-2020 "b" 1 2
01-Jan-2020 "c" 1 1
02-Jan-2020 "a" 0 0
02-Jan-2020 "b" 0 0
02-Jan-2020 "c" 0 0
03-Jan-2020 "a" 2 3
03-Jan-2020 "b" 0 0
03-Jan-2020 "c" 0 0
04-Jan-2020 "a" 0 0
04-Jan-2020 "b" 0 0
04-Jan-2020 "c" 1 1
This method should also give you the correct values for the GroupCounts as opposed to the ad hoc method in the previous comment. Hope this helps !!
More Answers (2)
dpb
on 14 May 2021
Edited: dpb
on 14 May 2021
Illustrate the concept --
tt.Var1=categorical(tt.Var1); % convert Var1 to categorical as rightfully is
dRef=datetime(tt.Time(1):tt.Time(end)).'; % get the full date vector
vRef=unique(tt.Var1); % and the unique values of Var1
% build a reference timetable of all times/characteristics
ttRef=timetable(reshape(repmat(dRef.',size(vRef,1),1),[],1), repmat(vRef,size(dRef,1),1));
ttRef.Var2=zeros(height(ttRef),1); % add Var2 column of zeros
[~,ib]=ismember(tt(:,{'Var1'}),ttRef(:,{'Var1'})); % locate the ones that are present in reference
ttRef.Var2(ib)=tt.Var2; % and replace with start
The above for the example leads to a fully-augmented timetable of:
ttRef =
12×2 timetable
Time Var1 Var2
___________ ____ ____
01-Jan-2020 a 2
01-Jan-2020 b 0
01-Jan-2020 c 1
02-Jan-2020 a 3
02-Jan-2020 b 2
02-Jan-2020 c 0
03-Jan-2020 a 2
03-Jan-2020 b 0
03-Jan-2020 c 0
04-Jan-2020 a 0
04-Jan-2020 b 0
04-Jan-2020 c 1
K>>
for which the previous rowfun solution will work
1 Comment
dpb
on 15 May 2021
NB: The above construction left out the insertion of the duplicated initial rows...remember to reinsert them as well.
dpb
on 14 May 2021
K>> rowfun(@sum,t,"InputVariables",'Var2','GroupingVariables',{'Time','Var1'})
ans =
6×3 timetable
Time Var1 GroupCount Var3
___________ ____ __________ ____
01-Jan-2020 "a" 1.00 2.00
01-Jan-2020 "c" 1.00 1.00
02-Jan-2020 "a" 1.00 3.00
02-Jan-2020 "b" 1.00 2.00
03-Jan-2020 "a" 2.00 3.00
04-Jan-2020 "c" 1.00 1.00
K>>
I'd just fill in the missing dates using ismember or setdiff to create the vector of missing dates as compared to a full vector from the first to last date in the timetable. That's still a set of vector operations on the resulting table.
Is there to be a "0" entry for each value of Var1, too, I suppose?
SOMEWHAT LATER ADDENDUM
I didn't try to write it, but seems you could use retime and custom function to fill the missing entries in the original timetable with zeros for each missing date/category; perhaps in conjunction with rowfun to using the grouping variable to cover the categories.
All of these, of course, have the looping construct in them, just at a lower level...
0 Comments
See Also
Categories
Find more on Tables in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!