I need to optimize this slow code. I am Working with three tables with uneven rows and same columns. Missing data already accounted for but the code runs slow.

Hi,
I am working on importing and comparing large chunks of data shown as tables. I have 3 .txt files which I want to import as tables. The problem is that,to concatenate them, I need to add NaNs to have variables with the same size to perform operations in Matlab (using the rs variable). The three tables are table C which has a size of 4873x15, table D with 5117x15 and table E with 5216 x 15. To import these, I have a bunch of missing data and variable types, so I used the import wizard to generate a function (importpsm) strictly for these sort of tables. Lastly, I want to save these tables so I don't have to run this code every time I need to analyze other variables. I have all the toolboxes available right now.
Here is the code, it is part of a for loop to access different txt files pertaining to three output files (B) It would be nice to have increasing tables with increasing B, but I have not found a way to do this yet.
...for loop starts after I define k as 1,A as 2, and B as 3...
elseif A == 2
filePattern = fullfile(myFolder, '*PSMs*.txt');
txtFiles = dir(filePattern);
for k = 1
while B == 3
C=importpsm(txtFiles(k).name);
D=importpsm(txtFiles(k+1).name);
E=importpsm(txtFiles(k+2).name);
rs=array2table(nan(max([size(C) size(D) size(E)]),width(C)),'variablenames',C.Properties.VariableNames);
rs(1:height(C),1:end)=C
end
end

13 Comments

Those aren't "big" files at all; you should be able to deal with them quite quickly.
Unfortunately, the code generated by the import wizard is generally not particularly "smart"; probably you'd be better off to use the detectImportOptions function on a representative of the file type and then edit it as needed for the fine points. Then, save that import object and use it and readtable directly.
It's not at all clear to me what you mean by the last sentence above?
Attaching part of a sample file would let somebody see the issues you're dealing with, specifically; it's hard to say something that really helps a specific problem from just a generic description.
Thanks for the quick turnaround!
I have a lot more files to analyze than these, I was just wondering what I could do to speed up the reading process for these files in particular. The detectImportOptions is useful, I will redo my code with this and get back to you.
As long as the data type by column is consistent you should have no significant issues.
I'm curious, though...you've got three files each with 15 columns -- are these fifteen different variables for each; hence you concatenating horizontally?
If they're the same variables but one variable is different by which to segregate them is the reason, then it would probably be far more efficient to do something other than concatenate them into one table -- a cell array or somesuch, maybe.
Need to know more about what the data format is, what the data really are and what is actual processing to be able to really make any specific recommendations, however, as noted before.
Hey,
The 15 variables are the exact same ones, these are shown separately for 3 replicates of the same experiment using the same parameters. Something is off in the experiment but my task is to find out what it is and I want to learn how to do this for a Ph.D. I want to see how I can compare these replicates using histograms and volcano plots. However, I cannot do volcano plots unless these separate variables are the exact same size. Here is the link I was planning on using: https://www.mathworks.com/help/bioinfo/ref/mavolcanoplot.html
Here is where I am stuck:
prompt='Which table do you want to open? 1-MS/MS 2-PSM 3-Peptide Groups 4-Protein Groups 5-Proteins 6-Results ';
A=input(prompt);
prompt2='Number of Trials? ';
B=input(prompt2);
tic
if isempty(A)
warning('No table was selected.')
end
if A == 1
Table=dir('*MSMSSpectrumInfo*.txt')
Table=importms(Data(A).name);
if isfile('MS.xls')
else
writetable(Table,'MS.xls','WriteVariableNames',true);
end
elseif A == 2
filePattern = fullfile(myFolder, '*PSMs*.txt');
txtFiles = dir(filePattern);
for k = 1
while B == 3
C=detectImportOptions(txtFiles(k).name);
C.MissingRule= 'fill';
C.SelectedVariableNames = ...
({'AnnotatedSequence','x_ProteinGroups','x_Proteins','x_MissedCleavages', ...
'Charge','m_z_Da_','MH__Da_','DeltaM_ppm_','Deltam_z_Da_','MSOrder', ...
'IonInjectTime_ms_','RT_min_'});
T=readtable(txtFiles(k).name,C);
U=readtable(txtFiles(k+1).name,C);
R=readtable(txtFiles(k+2).name,C);
rt=nan(max([size(T) size(U) size(R)]),width(T));
ru=rt;
rr=rt;
rt(1:height(T),1:end)=T;
ru(1:height(U),1:end)=U;
rr(1:height(R),1:end)=R;
end
end
The error I get is:
Error using table2array (line 27)
Unable to concatenate the table variables 'AnnotatedSequence' and 'x_ProteinGroups', because their types are cell and double.
I thought I could bypass the formatting of each column variable using importwizard, but your suggestion is faster.
Thanks in advance
Totally out of my field of expertise; never heard of the volcano plot before, but ... does it make any sense to mix the data together to do those plots? I'd think you'd want each experiment plotted separately but perhaps overlaid to view differences, not mixed up together???
To use the detectImportOptions object, I was suggesting you call
C=detectImportOptions(txtFiles(k).name);(txtFiles(k).name);
for a representative example of each file type (I thought they were all the same format from initial description?) outside the individual case and make whatever changes in it for the actual variable descriptions that it doesn't manage to recognize correctly (if there's a missing value in the first line for a given file, for example, it will return that as string rather than numeric and that isn't the way you really want to import the file; you need an accurate representation of what the file content is defined as to use). Then, you save that import options object and use it for every file of that type.
Again, it would make things much simpler to try to help specifically if you would attach a sample of the data file(s) that are giving trouble with a description of how want the various variables to be interpreted.
As some comments on coding structure and style...
  • For user input of files, look at menu or dialog for a little more user-friendly
  • Can't you determine the number of trials programmatically given the files to look at?
  • For selecting code sections, this structure would fit a switch structure very well
  • If factor out each of those areas into smaller functions that can be tested individually, top level complexity is reduced drastically and testing/debugging greatly improved
Just some gentle thoughts, suggestions... :)
Hi again,
I realize that I havent posted some sample data, I wanted to try and do most of the legwork but I realize that it is a complex way to read/import data when there are no files to analyze. From the 31 variables I would like to keep the 'Annotated Sequence' along with other variables, but the code takes ages to process 5216 rows of it with the numerical data. In the end, I chose to import numerical data separately and it works like a charm.
  • The dialog input was awesome, I am happy with how it works at the moment.
  • I implemented this to read the length of the files found with the extension instead of naming the number of trials. It works like a charm.
  • I am looking into the switch function right now, I am sure it will be useful to input.
  • I am starting out testing each option separately, thanks for your patience
Here's how it changed so far
%User defines which data to visualize and (if there is Excel no file already) save in Excel
list={'1-MS/MS'; '2-PSM'; '3-Peptide Groups'; '4-Protein Groups'; ...
'5-Proteins'; '6-Results'}
[A,tf] = listdlg('PromptString','Select a file:','ListString' ,list)
tic
if isempty(A)
warning('No table was selected.')
end
if A == 1
Table=dir('*MSMSSpectrumInfo*.txt')
Table=importms(Data(A).name);
if isfile('MS.xls')
else
writetable(Table,'MS.xls','WriteVariableNames',true);
end
elseif A == 2
filePattern = fullfile(myFolder, '*PSMs*.txt');
txtFiles = dir(filePattern);
C=detectImportOptions(txtFiles(1).name);
C.SelectedVariableNames=({...
'x_ProteinGroups','x_Proteins','x_MissedCleavages','Charge','m_z_Da_', ...
'MH__Da_','DeltaM_ppm_','Deltam_z_Da_','IonInjectTime_ms_', 'RT_min_'});
k = 1;
for B = length(txtFiles)
fprintf('Analyzing N = %d runs',B)
T=readtable(txtFiles(k).name,C);
U=readtable(txtFiles(k+1).name,C);
R=readtable(txtFiles(k+2).name,C);
end
rt = array2table(nan((max([size(T) size(U) size(R)])),width(T)),'VariableNames',T.Properties.VariableNames);
ru = rt;
rr = rt;
rt(1:height(T),:) = T;
ru(1:height(U), :) = U;
rr(1:height(R), :) = R;
T1=table2array({rt.DeltaM_ppm_});
T2=table2array({ru.DeltaM_ppm_});
T3=table2array({rr.DeltaM_ppm_});
Table=[T; U; R];
if isfile('PSM_Data.xls')
else
writetable(Table,'PSM_data.xls')
end
save Table
The willingness to dig in and tackle the job on your own is most admirable -- you've made good progress and undoubtedly have learned quite a bit along the way! :) Big plus for that!!!
I can see what you've done and can kinda' guess the "why" of how you got there...but (there's always a "but", isn't there??? :) ) it seems to me one could probably avoid quite a bit of those internal machinations as well.
What I guess I'm puzzled at though is that you indicated in the beginning files of 15 columns but you end up with the 31 variables -- I still don't have a clear picture of the raw data without more time trying to create what must actually be happening in detail from the code than want to take... :)
I'm still puzzled why there should be any significant problem in processing only 5K lines of any file but to try to get to the bottom of that really would require seeing a portion of the actual text imput file(s), not the end result.
One thing I notice in the Data table is AnnotatedSequence is a string array -- I wonder if that might not also be a categorical variable or is it, indeed, just comments?
BTW,
for B = length(txtFiles)
probably is not going to do what you want... length returns a single value so B will be only the one number and the for loop only will execute once with that particular value for B. But the variables inside are dependent upon k which is set outside the loop.
While it may work for the given case, it doesn't seem at all generic; is it true that there are, will be and always will be specifically three sequentially-named files? You can't always rely on the directory order being in a specific order; generally will be returned an alphabetic sorted order, but that is not documented behavior.
Aside: While it isn't an issue here, there's a nuance to length to be aware--it returns max(size()) for the argument so can be suprised sometimes...for example,
>> c=char('Fred','Tom')
c =
2×4 char array
'Fred'
'Tom '
>> length(c)
ans =
4
>>
returns the length of the strings in the character array, not the number of rows...
>> c=char('A','B','C')
c =
3×1 char array
'A'
'B'
'C'
>> length(c)
ans =
3
>>
otoh, returns the number of rows in the array since that's bigger numerically than the length of each character array in the 2D char array.
IOW, "there be dragons!"
Hi,
I actually narrowed things down to 15 columns which consist mostly of numerical data. Statistical analysis is what I want to do (ttest, volcano plots and histograms), but the 'Annotated sequence is a unique tag for every piece of the puzzle in these experiments. I want to know if these sequences are conserved if I do the same exact experiment three times and if I can find out what is happening by plotting the variables from the columns.
From the 31 variables available, I want to keep 10 columns (for now) but, if I add the Annotated Sequence Column, it takes ages for the code to run to add the NaN's so I can have an even number of rows for each matrix. Without running the NaN code to even the number of rows, I cannot do statistical analysis or volcano plots for that matter. I can settle to index the mass (m_z_Da) from the original tables to find the actual annotated sequence. From the 31 variables, I am interested in 10 right now as shown below:
'x_ProteinGroups','x_Proteins','x_MissedCleavages','Charge','m_z_Da_', ...
'MH__Da_','DeltaM_ppm_','Deltam_z_Da_','IonInjectTime_ms_', 'RT_min_'
So the code I have right now works just fine for that (elapsed time is ~2 seconds). Maybe changing the annotated sequence to a categorical variable might work. I will get back to work in a few minutes :)
Thanks
I missed this comment earlier so if you've not gone away totally...
"...if I add the Annotated Sequence Column, it takes ages for the code to run to add the NaN's so I can have an even number of rows for each matrix"
I do wish you would attach one of the input files (even a short subsection if they're very long) so we can see what you're dealing with, explicitly. This should not be a hard problem to solve but we can't fix what we can't see...
"Without running the NaN code to even the number of rows, I cannot do statistical analysis or volcano plots for that matter."
I don't follow this reasoning at all -- ttest2 is able to run with different sample sizes to match the actual situation; you don't need to mung on the data for that purpose. Also, you can plot whatever you want with the actual data, plotting one, then using hold on and adding the other plots is the same thing with much less effort as using the NaN as empty placeholder.
We can help if you'll let us... :)
To address your second comment:
I wish to get a p-value for each of the individual rows. I tried ttest2 successfully but it gives me one general p value for all rows.
I chose to use the mattest function because I want the individual p values. However, it doesnt work if my rows are uneven for each trial. When I put the nan rows in, I get a p-value of zero for every row.
With p-values of zero for every row (using mattest), my volcano plot throws an error because it ignores zeros and nothing can be plotted.
So I am stuck at this point and am waiting for authorization to send the files and see if we can figure this out :(
OK. If you don't want to post files/data in open forum, you can contact me directly thru my link -- I'll be glad to look at them offline and keep any confidentiality needs--I spent 30+ yrs in consulting gig so know that routine well... :)
I need to know more about what we're trying to test as the null hypothesis to understand what p value is needed...and thus how to generate it.

Sign in to comment.

Answers (1)

Hmm... It seems that my txtFiles variable is type struct, so I guess that is why length is working right now. The experiments are numbered, so it seems like these are placed in order right now. screenshot.png

2 Comments

Yes, dir returns a 1D struct array so length does work there--I was just making a side note that the particular function can surprise on occasion.
The order ML returns (on Windows anyway) is alphabetical, yes. Again, I was just noting that while it's that way now and this particular set works, neither is really guaranteed.
As an example of what can go wrong, after running
for i=1:20
fid=fopen(num2str(i,'T%d.txt'),'w');
fprintf(fid,'%s\n','Record');
fid=fclose(fid);
end
d=dir('T??.txt');
for i=1:length(d),disp(d(i).name),end
T1.txt
T10.txt
T11.txt
T12.txt
...
T18.txt
T19.txt
T2.txt
T20.txt
T3.txt
T4.txt
...
which alphabetic but not natural order...the Current Folder window in ML would show those in natural order, however.
Again, "just sayin'!" :) It's possible your code will work just fine for the limited universe in which it needs to operate; just making you aware that "stuff happens!".
Thanks!
Sorry I haven't written back. I appreciate your helpful comments, I know I will use them in the future. Wrangling the data is my actual test to see if I am the best candidate to analyze data for my lab (I am not part of it yet). I have no problem with seeking your assistance (which I am truly grateful for! :) ) but I need to verify if I can share complete files before actually doing so.
I will get back to you by the end of the day.

Sign in to comment.

Asked:

on 21 Nov 2018

Edited:

dpb
on 26 Nov 2018

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!