Collating data from different sources

4 views (last 30 days)
Stephen
Stephen on 4 Oct 2023
Edited: Jon on 5 Oct 2023
I have two csv datasets t1 and t2:
t1: t2:
A B C D A B C D
1 5 4.1 "a" 4 3 4.2 "e"
2 3 3.8 "g" 7 5 2.5 "c"
3 2 3.6 "d" 9 4 4.3 "b"
4 NaN 4.2 "e"
5 6 4.2 "h"
6 2 3.6 "d"
7 NaN 2.5 "c"
8 5 1.8 "e"
9 NaN 4.3 "b"
10 6 2.6 "a"
So t1 is missing some values but is the 'master' table. t2 has only the rows with missing values from t1, but with the data filled (from another source).
I am trying to fill the missing values in t1 using the values from t2, where variable 'A' is like an index (having unique values for each observation).
I have tried using outerjoin() but the result is:
t3:
tleft_A tleft_B tleft_C tleft_D tright_A tright_B tright_C tright_D
1 5 4.1 "a" NaN NaN NaN NaN
2 3 3.8 "g" NaN NaN NaN NaN
3 2 3.6 "d" NaN NaN NaN NaN
4 NaN 4.2 "e" 4 3 2.5 "e"
5 6 4.2 "h" NaN NaN NaN NaN
6 2 3.6 "d" NaN NaN NaN NaN
7 NaN 2.5 "c" 7 5 2.5 "c"
8 5 1.8 "e" NaN NaN NaN NaN
9 NaN 4.3 "b" 9 4 4.3 "b"
10 6 2.6 "a" NaN NaN NaN NaN
What I am trying to achieve, however, is:
t3:
A B C D
1 5 4.1 "a"
2 3 3.8 "g"
3 2 3.6 "d"
4 3 4.2 "e"
5 6 4.2 "h"
6 2 3.6 "d"
7 5 2.5 "c"
8 5 1.8 "e"
9 4 4.3 "b"
10 6 2.6 "a"
Is there a simple way to accomplish this, or do I need to extract, index and combine the variables, or approach it programmatically via a loop or something similar? None of the interpolation-type fill methods seem to be appropriate either.
I'm fairly new to MatLab, so if there is any other information I am missing which would help clarify the problem, I'll be happy to update the question with it.
Thanks in advance for any insights you can provide.
  2 Comments
Dyuman Joshi
Dyuman Joshi on 4 Oct 2023
Please attach the csv files, use the paperclip button to do so.
Stephen
Stephen on 4 Oct 2023
Thanks for your response Dyuman.
The 'data' in the question is abstracted, the csv files are 1000+ records in length with more fields. It's actually the Titanic dataset from Kaggle. I simply extracted the records with missing data and filled them in with data collected from other expert sources. I was primarily hoping for a generalised strategy, to solve the type of problem the abstracted data exposed, since I anticipate running into the same type of issue with other datasets in the future. I will include the csv files, for your edification, but just figured I must be missing something in terms of strategy.

Sign in to comment.

Answers (1)

Jon
Jon on 4 Oct 2023
You could use this approach
c1 = {
'A' 'B' 'C' 'D'
1 5 4.1 "a"
2 3 3.8 "g"
3 2 3.6 "d"
4 NaN 4.2 "e"
5 6 4.2 "h"
6 2 3.6 "d"
7 NaN 2.5 "c"
8 5 1.8 "e"
9 NaN 4.3 "b"
10 6 2.6 "a"};
c2 = {
'A' 'B' 'C' 'D'
4 3 4.2 "e"
7 5 2.5 "c"
9 4 4.3 "b"};
t1 = cell2table(c1(2:end,:));
t1.Properties.VariableNames = c1(1,:)
t1 = 10×4 table
A B C D __ ___ ___ ___ 1 5 4.1 "a" 2 3 3.8 "g" 3 2 3.6 "d" 4 NaN 4.2 "e" 5 6 4.2 "h" 6 2 3.6 "d" 7 NaN 2.5 "c" 8 5 1.8 "e" 9 NaN 4.3 "b" 10 6 2.6 "a"
t2 = cell2table(c2(2:end,:));
t2.Properties.VariableNames = c2(1,:)
t2 = 3×4 table
A B C D _ _ ___ ___ 4 3 4.2 "e" 7 5 2.5 "c" 9 4 4.3 "b"
% Find the indices of the missing data in t1
idxMissing = find(isnan(t1.B))
idxMissing = 3×1
4 7 9
% Fill the values from t2
t3 = t1;
[~,idx2] = ismember(idxMissing,t2.A)
idx2 = 3×1
1 2 3
t3(idxMissing,:) = t2(idx2,:)
t3 = 10×4 table
A B C D __ _ ___ ___ 1 5 4.1 "a" 2 3 3.8 "g" 3 2 3.6 "d" 4 3 4.2 "e" 5 6 4.2 "h" 6 2 3.6 "d" 7 5 2.5 "c" 8 5 1.8 "e" 9 4 4.3 "b" 10 6 2.6 "a"
  7 Comments
Stephen
Stephen on 5 Oct 2023
I'll give it a little time, and if no one else has a more succinct method, I'll be happy to accept this as the answer. If I come up with something else in my researching, I'll update in a comment here so you can get the notification.
Cheers again for your input Jon
Jon
Jon on 5 Oct 2023
Edited: Jon on 5 Oct 2023
Here is another approach, that utilizes operations with timetables, rather than indexing. Still a little arcane, but maybe you prefer this. You may want to remove a few more semicolons at end of statements to understand better what each step is doing
c1 = {
'A' 'B' 'C' 'D'
1 5 4.1 "a"
2 3 3.8 "g"
3 2 3.6 "d"
4 NaN 4.2 "e"
5 6 4.2 "h"
6 2 3.6 "d"
7 NaN 2.5 "c"
8 5 1.8 "e"
9 NaN 4.3 "b"
10 6 2.6 "a"};
c2 = {
'A' 'B' 'C' 'D'
4 3 4.2 "e"
7 5 2.5 "c"
8 5 1.8 "e"
9 4 4.3 "b"
15 8 6.2 "g"};
% Put data into timetables, using A as the time variable
t1 = cell2table(c1(2:end,:),'VariableNames',c1(1,:));
t1.A = days(t1.A); % convert first column to a duration
t1 = table2timetable(t1);
t2 = cell2table(c2(2:end,:),'VariableNames',c2(1,:));
t2.A = days(t2.A); % convert first column to a duration
t2 = table2timetable(t2);
% Combine the tables (makes union of timetables) putting missing data
% wherever data is not found for a given time
t3 = synchronize(t1,t2)
t3 = 11×6 timetable
A B_t1 C_t1 D_t1 B_t2 C_t2 D_t2 _______ ____ ____ _________ ____ ____ _________ 1 day 5 4.1 "a" NaN NaN <missing> 2 days 3 3.8 "g" NaN NaN <missing> 3 days 2 3.6 "d" NaN NaN <missing> 4 days NaN 4.2 "e" 3 4.2 "e" 5 days 6 4.2 "h" NaN NaN <missing> 6 days 2 3.6 "d" NaN NaN <missing> 7 days NaN 2.5 "c" 5 2.5 "c" 8 days 5 1.8 "e" 5 1.8 "e" 9 days NaN 4.3 "b" 4 4.3 "b" 10 days 6 2.6 "a" NaN NaN <missing> 15 days NaN NaN <missing> 8 6.2 "g"
% Replace the missing values using values from t2
% note by defualt, min omits missing, e.g. NaN
t3.B_t1 = min(t3.B_t1,t3.B_t2);
% Remove columns from t2
t3 = t3(:,1:3);
% Remove any rows that were in t2 but not t1
% These will have missing data from the synchronization
t3 = rmmissing(t3);
% Optionally clean up the variable names
t3.Properties.VariableNames = c1(1,2:end)
t3 = 10×3 timetable
A B C D _______ _ ___ ___ 1 day 5 4.1 "a" 2 days 3 3.8 "g" 3 days 2 3.6 "d" 4 days 3 4.2 "e" 5 days 6 4.2 "h" 6 days 2 3.6 "d" 7 days 5 2.5 "c" 8 days 5 1.8 "e" 9 days 4 4.3 "b" 10 days 6 2.6 "a"

Sign in to comment.

Categories

Find more on Data Type Conversion in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!