Collating data from different sources

Question

Stephen on 4 Oct 2023

0
Link

Direct link to this question

https://uk.mathworks.com/matlabcentral/answers/2029259-collating-data-from-different-sources

Edited: Jon on 5 Oct 2023

I have two csv datasets t1 and t2:

t1:                                 t2:
A       B       C       D           A       B       C       D
     5       4.1     "a"         4       3       4.2     "e"
     3       3.8     "g"         7       5       2.5     "c"
     2       3.6     "d"         9       4       4.3     "b"
     NaN     4.2     "e"
     6       4.2     "h"
     2       3.6     "d"
     NaN     2.5     "c"
     5       1.8     "e"
     NaN     4.3     "b"
    6       2.6     "a"

So t1 is missing some values but is the 'master' table. t2 has only the rows with missing values from t1, but with the data filled (from another source).

I am trying to fill the missing values in t1 using the values from t2, where variable 'A' is like an index (having unique values for each observation).

I have tried using outerjoin() but the result is:

t3:
tleft_A     tleft_B     tleft_C     tleft_D     tright_A        tright_B        tright_C        tright_D
           5           4.1         "a"           NaN            NaN              NaN             NaN                            
           3           3.8         "g"           NaN            NaN              NaN             NaN
           2           3.6         "d"           NaN            NaN              NaN             NaN
          NaN          4.2         "e"           4               3               2.5             "e"
           6           4.2         "h"           NaN            NaN              NaN             NaN
           2           3.6         "d"           NaN            NaN              NaN             NaN
          NaN          2.5         "c"           7               5               2.5             "c"
           5           1.8         "e"           NaN            NaN              NaN             NaN
          NaN          4.3         "b"           9               4               4.3             "b"
          6           2.6         "a"           NaN            NaN              NaN             NaN

What I am trying to achieve, however, is:

t3:
A       B       C       D
     5      4.1     "a"
     3      3.8     "g"
     2      3.6     "d"
     3      4.2     "e"
     6      4.2     "h"
     2      3.6     "d"
     5      2.5     "c"
     5      1.8     "e"
     4      4.3     "b"
    6      2.6     "a" 

Is there a simple way to accomplish this, or do I need to extract, index and combine the variables, or approach it programmatically via a loop or something similar? None of the interpolation-type fill methods seem to be appropriate either.

I'm fairly new to MatLab, so if there is any other information I am missing which would help clarify the problem, I'll be happy to update the question with it.

Thanks in advance for any insights you can provide.

2 Comments
Show NoneHide None

Dyuman Joshi on 4 Oct 2023

Please attach the csv files, use the paperclip button to do so.

Stephen on 4 Oct 2023

Thanks for your response Dyuman.

The 'data' in the question is abstracted, the csv files are 1000+ records in length with more fields. It's actually the Titanic dataset from Kaggle. I simply extracted the records with missing data and filled them in with data collected from other expert sources. I was primarily hoping for a generalised strategy, to solve the type of problem the abstracted data exposed, since I anticipate running into the same type of issue with other datasets in the future. I will include the csv files, for your edification, but just figured I must be missing something in terms of strategy.

Sign in to comment.

Sign in to answer this question.

Answer 1

Jon on 4 Oct 2023

0
Link

Direct link to this answer

https://uk.mathworks.com/matlabcentral/answers/2029259-collating-data-from-different-sources#answer_1325909

Open in MATLAB Online

You could use this approach

c1 = {
'A'       'B'       'C'       'D'          
1       5       4.1     "a"         
2       3       3.8     "g"         
3       2       3.6     "d"        
4       NaN     4.2     "e"
5       6       4.2     "h"
6       2       3.6     "d"
7       NaN     2.5     "c"
8       5       1.8     "e"
9       NaN     4.3     "b"
10      6       2.6     "a"};
c2 = {
'A'       'B'       'C'       'D'
 4       3       4.2     "e"
 7       5       2.5     "c"
 9       4       4.3     "b"};
t1 = cell2table(c1(2:end,:));
t1.Properties.VariableNames = c1(1,:)
t1 = 10×4 table
    A      B      C      D 
    __    ___    ___    ___

     1      5    4.1    "a"
     2      3    3.8    "g"
     3      2    3.6    "d"
     4    NaN    4.2    "e"
     5      6    4.2    "h"
     6      2    3.6    "d"
     7    NaN    2.5    "c"
     8      5    1.8    "e"
     9    NaN    4.3    "b"
    10      6    2.6    "a"
t2 = cell2table(c2(2:end,:)); 
t2.Properties.VariableNames = c2(1,:)
t2 = 3×4 table
    A    B     C      D 
    _    _    ___    ___

    4    3    4.2    "e"
    7    5    2.5    "c"
    9    4    4.3    "b"
% Find the indices of the missing data in t1
idxMissing = find(isnan(t1.B))
idxMissing = 3×1
     4
     7
     9
% Fill the values from t2
t3 = t1;
[~,idx2]  = ismember(idxMissing,t2.A) 
idx2 = 3×1
     1
     2
     3
t3(idxMissing,:) = t2(idx2,:)
t3 = 10×4 table
    A     B     C      D 
    __    _    ___    ___

     1    5    4.1    "a"
     2    3    3.8    "g"
     3    2    3.6    "d"
     4    3    4.2    "e"
     5    6    4.2    "h"
     6    2    3.6    "d"
     7    5    2.5    "c"
     8    5    1.8    "e"
     9    4    4.3    "b"
    10    6    2.6    "a"

7 Comments
Show 5 older commentsHide 5 older comments

Jon on 4 Oct 2023

Open in MATLAB Online

Note this will also work if t2 has extra rows that don't correspond to missing data for example

                                 
c1 = {
'A'       'B'       'C'       'D'          
1       5       4.1     "a"         
2       3       3.8     "g"         
3       2       3.6     "d"        
4       NaN     4.2     "e"
5       6       4.2     "h"
6       2       3.6     "d"
7       NaN     2.5     "c"
8       5       1.8     "e"
9       NaN     4.3     "b"
10      6       2.6     "a"};
c2 = {
'A'       'B'       'C'       'D'
 4       3       4.2     "e"
 7       5       2.5     "c"
 8       5       1.8     "e"
 9       4       4.3     "b"
 15      8       6.2     "g"};
t1 = cell2table(c1(2:end,:));
t1.Properties.VariableNames = c1(1,:)
t1 = 10×4 table
    A      B      C      D 
    __    ___    ___    ___

     1      5    4.1    "a"
     2      3    3.8    "g"
     3      2    3.6    "d"
     4    NaN    4.2    "e"
     5      6    4.2    "h"
     6      2    3.6    "d"
     7    NaN    2.5    "c"
     8      5    1.8    "e"
     9    NaN    4.3    "b"
    10      6    2.6    "a"
t2 = cell2table(c2(2:end,:)); 
t2.Properties.VariableNames = c2(1,:)
t2 = 5×4 table
    A     B     C      D 
    __    _    ___    ___

     4    3    4.2    "e"
     7    5    2.5    "c"
     8    5    1.8    "e"
     9    4    4.3    "b"
    15    8    6.2    "g"
% Find the indices of the missing data in t1
idxMissing = find(isnan(t1.B))
idxMissing = 3×1
     4
     7
     9
% Fill the values from t2
t3 = t1;
[~,idx2]  = ismember(idxMissing,t2.A) 
idx2 = 3×1
     1
     2
     4
t3(idxMissing,:) = t2(idx2,:)
t3 = 10×4 table
    A     B     C      D 
    __    _    ___    ___

     1    5    4.1    "a"
     2    3    3.8    "g"
     3    2    3.6    "d"
     4    3    4.2    "e"
     5    6    4.2    "h"
     6    2    3.6    "d"
     7    5    2.5    "c"
     8    5    1.8    "e"
     9    4    4.3    "b"
    10    6    2.6    "a"

Stephen on 5 Oct 2023

I'll give it a little time, and if no one else has a more succinct method, I'll be happy to accept this as the answer. If I come up with something else in my researching, I'll update in a comment here so you can get the notification.

Cheers again for your input Jon

Jon on 5 Oct 2023

Edited: Jon on 5 Oct 2023

Open in MATLAB Online

Here is another approach, that utilizes operations with timetables, rather than indexing. Still a little arcane, but maybe you prefer this. You may want to remove a few more semicolons at end of statements to understand better what each step is doing

                                 
c1 = {
'A'       'B'       'C'       'D'          
1       5       4.1     "a"         
2       3       3.8     "g"         
3       2       3.6     "d"        
4       NaN     4.2     "e"
5       6       4.2     "h"
6       2       3.6     "d"
7       NaN     2.5     "c"
8       5       1.8     "e"
9       NaN     4.3     "b"
10      6       2.6     "a"};
c2 = {
'A'       'B'       'C'       'D'
 4       3       4.2     "e"
 7       5       2.5     "c"
 8       5       1.8     "e"
 9       4       4.3     "b"
 15      8       6.2     "g"};
% Put data into timetables, using A as the time variable
t1 = cell2table(c1(2:end,:),'VariableNames',c1(1,:));
t1.A = days(t1.A); % convert first column to a duration
t1 = table2timetable(t1);
t2 = cell2table(c2(2:end,:),'VariableNames',c2(1,:));
t2.A = days(t2.A); % convert first column to a duration
t2 = table2timetable(t2);
% Combine the tables (makes union of timetables) putting missing data
% wherever data is not found for a given time
t3 = synchronize(t1,t2)
t3 = 11×6 timetable
       A       B_t1    C_t1      D_t1       B_t2    C_t2      D_t2   
    _______    ____    ____    _________    ____    ____    _________

    1 day        5     4.1     "a"          NaN     NaN     <missing>
    2 days       3     3.8     "g"          NaN     NaN     <missing>
    3 days       2     3.6     "d"          NaN     NaN     <missing>
    4 days     NaN     4.2     "e"            3     4.2     "e"      
    5 days       6     4.2     "h"          NaN     NaN     <missing>
    6 days       2     3.6     "d"          NaN     NaN     <missing>
    7 days     NaN     2.5     "c"            5     2.5     "c"      
    8 days       5     1.8     "e"            5     1.8     "e"      
    9 days     NaN     4.3     "b"            4     4.3     "b"      
    10 days      6     2.6     "a"          NaN     NaN     <missing>
    15 days    NaN     NaN     <missing>      8     6.2     "g"      
% Replace the missing values using values from t2
% note by defualt, min omits missing, e.g. NaN
t3.B_t1 = min(t3.B_t1,t3.B_t2);
% Remove columns from t2
t3 = t3(:,1:3);
% Remove any rows that were in t2 but not t1
% These will have missing data from the synchronization
t3 = rmmissing(t3);
% Optionally clean up the variable names
t3.Properties.VariableNames = c1(1,2:end)
t3 = 10×3 timetable
       A       B     C      D 
    _______    _    ___    ___

    1 day      5    4.1    "a"
    2 days     3    3.8    "g"
    3 days     2    3.6    "d"
    4 days     3    4.2    "e"
    5 days     6    4.2    "h"
    6 days     2    3.6    "d"
    7 days     5    2.5    "c"
    8 days     5    1.8    "e"
    9 days     4    4.3    "b"
    10 days    6    2.6    "a"

Sign in to comment.

Collating data from different sources

2 Comments
Show NoneHide None

Answers (1)

7 Comments
Show 5 older commentsHide 5 older comments

See Also

Categories

Tags

Community Treasure Hunt

Collating data from different sources

2 Comments Show NoneHide None

Answers (1)

7 Comments Show 5 older commentsHide 5 older comments

See Also

Categories

Tags

Community Treasure Hunt

2 Comments
Show NoneHide None

7 Comments
Show 5 older commentsHide 5 older comments