Filter or delete data from (time)table depending of missing values

10 views (last 30 days)
Hello dear community (from which I already read a lot - thank you).
I've got datasets (solar irradiation - GHI) for approx. 4 years with a 10mn timestep ; and it's full of holes. I need to get a hourly average of GHI and I want to exclude/delete any "hour" that hasn't 6 GHI values (for 0, 10, 20, 30, 40 and 50 minutes within eah hour, of each day, of each month, of each year).
I joined a sample of one dataset. I already know how to correct some flaws of the file, like to round some datetime value that are for exemple (HH:MM:SS) 10:09:59 instead of 10:10:10 (with dateshift 'start','minute','nearest').
I already tried "retime" by creating a timetable, filled with NaN values for missing GHI but this is not convenient as I also need access to the datetime vector (once filtered) for comparisons with other datasets. Is there anyway to avoid a loop (datasets sizes vary from 200k to 300k rows) ? If not, a hint would be appreciated.
Thank you very much
  10 Comments
Paul MAUGER
Paul MAUGER on 12 Apr 2021
Thanks for the advice. I indeed thought I would gain some speed not keeping tables and array I don't need anymore.
Walter Roberson
Walter Roberson on 13 Apr 2021
Consider for example
bar = [1 2 3];
bar(2)
clear bar
bar(2)
if rand() < 0.5; bar = [4 5 6]; end
bar(2)
The first bar(2), bar is a variable and indexing is to be used.
The second bar(2), bar has been cleared. But look -- bar() is the name of a function, so the second bar(2) invokes the graphics function!
The third bar(2)... bar might have been assigned a value, or it might not have been, so the third bar(2) might be indexing or it might be a function call...
If there had been no clear of bar, then it would not be necessary to re-resolve the name and no need to have to insert the hooks to deal with the possibility that you are dealing with function instead of variable.
In the case where a variable is never going to be referred to again in the same function, then it is not clear whether there is a performance penalty for clearing the variable as opposed to assigning empty to it, but Mathworks has specifically guided to avoid clearing variables inside a loop.

Sign in to comment.

Accepted Answer

Walter Roberson
Walter Roberson on 10 Apr 2021
[Copying answer down from comments]
Snan = @(v)sum(v) + 0/(numel(v)==6);
datasum = rmmissing(retime(data, 'hourly', Snan));
(This was a bit more difficult to develop than might appear from the code.)
Note that this will treat partial hours (beginning / ending of data) as if the rest of the data is missing... which is true from one perspective.
Also note that this will always treat an hour entry as being the beginning of a new group of data. For example, 01:10, 01:20, 01:30, 01:40, 01:50, 02:00 would be treated as 5 entries for 01:xx and 1 entry for 02:xx and both hours would be declared as incomplete. If you want the exact hour to be treated with the previous hour, subtract 1 from the minutes (or seconds)
The 0/(numel(v)==6) is a hack to introduce a nan in the case where there are not exactly 6 elements, without coding a literal nan. Literal nans "pollute" the computation: you cannot, for example, eliminate a nan by saying NaN*0 .
Another way to write the code without that hack would be something like
Select = @(v, cond) v(cond+1);
Snan = @(v) Select([nan, sum(v)], numel(v) == 6);

More Answers (0)

Categories

Find more on Timetables in Help Center and File Exchange

Products


Release

R2020a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!