I need to optimize this slow code. I am Working with three tables with uneven rows and same columns. Missing data already accounted for but the code runs slow.

Question

0 votes

Hi,

I am working on importing and comparing large chunks of data shown as tables. I have 3 .txt files which I want to import as tables. The problem is that,to concatenate them, I need to add NaNs to have variables with the same size to perform operations in Matlab (using the rs variable). The three tables are table C which has a size of 4873x15, table D with 5117x15 and table E with 5216 x 15. To import these, I have a bunch of missing data and variable types, so I used the import wizard to generate a function (importpsm) strictly for these sort of tables. Lastly, I want to save these tables so I don't have to run this code every time I need to analyze other variables. I have all the toolboxes available right now.

Here is the code, it is part of a for loop to access different txt files pertaining to three output files (B) It would be nice to have increasing tables with increasing B, but I have not found a way to do this yet.

...for loop starts after I define k as 1,A as 2, and B as 3...
elseif A == 2
  filePattern = fullfile(myFolder, '*PSMs*.txt');
  txtFiles   = dir(filePattern);
  for k = 1
    while B == 3
      C=importpsm(txtFiles(k).name);
      D=importpsm(txtFiles(k+1).name);
      E=importpsm(txtFiles(k+2).name);
      rs=array2table(nan(max([size(C) size(D) size(E)]),width(C)),'variablenames',C.Properties.VariableNames);
      rs(1:height(C),1:end)=C
    end
  end

13 Comments
Show 11 older comments Hide 11 older comments

Amanda Figueroa on 22 Nov 2018

Edited: dpb on 22 Nov 2018

Open in MATLAB Online

Hey,

The 15 variables are the exact same ones, these are shown separately for 3 replicates of the same experiment using the same parameters. Something is off in the experiment but my task is to find out what it is and I want to learn how to do this for a Ph.D. I want to see how I can compare these replicates using histograms and volcano plots. However, I cannot do volcano plots unless these separate variables are the exact same size. Here is the link I was planning on using: https://www.mathworks.com/help/bioinfo/ref/mavolcanoplot.html

Here is where I am stuck:

prompt='Which table do you want to open? 1-MS/MS 2-PSM 3-Peptide Groups 4-Protein Groups 5-Proteins 6-Results  ';
A=input(prompt);
prompt2='Number of Trials?   ';
B=input(prompt2);
tic
if isempty(A)
    warning('No table was selected.')
end
if A == 1
  Table=dir('*MSMSSpectrumInfo*.txt')
  Table=importms(Data(A).name);
  if isfile('MS.xls')
  else
    writetable(Table,'MS.xls','WriteVariableNames',true);
  end
elseif A == 2
  filePattern = fullfile(myFolder, '*PSMs*.txt');
  txtFiles   = dir(filePattern);
  for k = 1
    while B == 3
      C=detectImportOptions(txtFiles(k).name);
      C.MissingRule= 'fill';
      C.SelectedVariableNames = ...
        ({'AnnotatedSequence','x_ProteinGroups','x_Proteins','x_MissedCleavages', ...
          'Charge','m_z_Da_','MH__Da_','DeltaM_ppm_','Deltam_z_Da_','MSOrder', ...
          'IonInjectTime_ms_','RT_min_'});
      T=readtable(txtFiles(k).name,C);
      U=readtable(txtFiles(k+1).name,C);
      R=readtable(txtFiles(k+2).name,C);
      rt=nan(max([size(T) size(U) size(R)]),width(T));
      ru=rt;
      rr=rt;
      rt(1:height(T),1:end)=T;
      ru(1:height(U),1:end)=U;
      rr(1:height(R),1:end)=R;
    end
  end

The error I get is:

Error using table2array (line 27)

Unable to concatenate the table variables 'AnnotatedSequence' and 'x_ProteinGroups', because their types are cell and double.

I thought I could bypass the formatting of each column variable using importwizard, but your suggestion is faster.

Thanks in advance

dpb on 22 Nov 2018

Totally out of my field of expertise; never heard of the volcano plot before, but ... does it make any sense to mix the data together to do those plots? I'd think you'd want each experiment plotted separately but perhaps overlaid to view differences, not mixed up together???

To use the detectImportOptions object, I was suggesting you call

C=detectImportOptions(txtFiles(k).name);(txtFiles(k).name);

for a representative example of each file type (I thought they were all the same format from initial description?) outside the individual case and make whatever changes in it for the actual variable descriptions that it doesn't manage to recognize correctly (if there's a missing value in the first line for a given file, for example, it will return that as string rather than numeric and that isn't the way you really want to import the file; you need an accurate representation of what the file content is defined as to use). Then, you save that import options object and use it for every file of that type.

Again, it would make things much simpler to try to help specifically if you would attach a sample of the data file(s) that are giving trouble with a description of how want the various variables to be interpreted.

Amanda Figueroa on 23 Nov 2018

Edited: dpb on 23 Nov 2018

Open in MATLAB Online

matlab.mat

Hi again,

I realize that I havent posted some sample data, I wanted to try and do most of the legwork but I realize that it is a complex way to read/import data when there are no files to analyze. From the 31 variables I would like to keep the 'Annotated Sequence' along with other variables, but the code takes ages to process 5216 rows of it with the numerical data. In the end, I chose to import numerical data separately and it works like a charm.

The dialog input was awesome, I am happy with how it works at the moment.
I implemented this to read the length of the files found with the extension instead of naming the number of trials. It works like a charm.
I am looking into the switch function right now, I am sure it will be useful to input.
I am starting out testing each option separately, thanks for your patience

Here's how it changed so far

%User defines which data to visualize and (if there is Excel no file already) save in Excel
list={'1-MS/MS'; '2-PSM'; '3-Peptide Groups'; '4-Protein Groups'; ... 
      '5-Proteins'; '6-Results'}
[A,tf] = listdlg('PromptString','Select a file:','ListString' ,list)
tic
if isempty(A)
    warning('No table was selected.')
end
if A == 1
    Table=dir('*MSMSSpectrumInfo*.txt')
    Table=importms(Data(A).name);
    if isfile('MS.xls')
    else
        writetable(Table,'MS.xls','WriteVariableNames',true);
    end
elseif A == 2
    filePattern = fullfile(myFolder, '*PSMs*.txt');
    txtFiles   = dir(filePattern);
    C=detectImportOptions(txtFiles(1).name);
    C.SelectedVariableNames=({...
        'x_ProteinGroups','x_Proteins','x_MissedCleavages','Charge','m_z_Da_', ...
        'MH__Da_','DeltaM_ppm_','Deltam_z_Da_','IonInjectTime_ms_', 'RT_min_'});
    k = 1;
    for B = length(txtFiles)
        fprintf('Analyzing N = %d runs',B)
        T=readtable(txtFiles(k).name,C);
        U=readtable(txtFiles(k+1).name,C);
        R=readtable(txtFiles(k+2).name,C);
    end
    rt = array2table(nan((max([size(T) size(U) size(R)])),width(T)),'VariableNames',T.Properties.VariableNames);
    ru = rt;
    rr = rt;
    rt(1:height(T),:) = T;
    ru(1:height(U), :) = U;
    rr(1:height(R), :) = R;
    T1=table2array({rt.DeltaM_ppm_});
    T2=table2array({ru.DeltaM_ppm_});
    T3=table2array({rr.DeltaM_ppm_});
    Table=[T; U; R];
    if isfile('PSM_Data.xls')
    else
        writetable(Table,'PSM_data.xls')
    end
    save Table
    

dpb on 23 Nov 2018

The willingness to dig in and tackle the job on your own is most admirable -- you've made good progress and undoubtedly have learned quite a bit along the way! :) Big plus for that!!!

I can see what you've done and can kinda' guess the "why" of how you got there...but (there's always a "but", isn't there??? :) ) it seems to me one could probably avoid quite a bit of those internal machinations as well.

What I guess I'm puzzled at though is that you indicated in the beginning files of 15 columns but you end up with the 31 variables -- I still don't have a clear picture of the raw data without more time trying to create what must actually be happening in detail from the code than want to take... :)

I'm still puzzled why there should be any significant problem in processing only 5K lines of any file but to try to get to the bottom of that really would require seeing a portion of the actual text imput file(s), not the end result.

One thing I notice in the Data table is AnnotatedSequence is a string array -- I wonder if that might not also be a categorical variable or is it, indeed, just comments?

dpb on 23 Nov 2018

Edited: dpb on 23 Nov 2018

Open in MATLAB Online

BTW,

for B = length(txtFiles)

probably is not going to do what you want... length returns a single value so B will be only the one number and the for loop only will execute once with that particular value for B. But the variables inside are dependent upon k which is set outside the loop.

While it may work for the given case, it doesn't seem at all generic; is it true that there are, will be and always will be specifically three sequentially-named files? You can't always rely on the directory order being in a specific order; generally will be returned an alphabetic sorted order, but that is not documented behavior.

Aside: While it isn't an issue here, there's a nuance to length to be aware--it returns max(size()) for the argument so can be suprised sometimes...for example,

 >> c=char('Fred','Tom')
c =
  2×4 char array
    'Fred'
    'Tom '
>> length(c)
ans =
     4
>> 

returns the length of the strings in the character array, not the number of rows...

>> c=char('A','B','C')
c =
  3×1 char array
    'A'
    'B'
    'C'
>> length(c)
ans =
     3
>> 

otoh, returns the number of rows in the array since that's bigger numerically than the length of each character array in the 2D char array.

IOW, "there be dragons!"

Amanda Figueroa on 23 Nov 2018

Open in MATLAB Online

Hi,

I actually narrowed things down to 15 columns which consist mostly of numerical data. Statistical analysis is what I want to do (ttest, volcano plots and histograms), but the 'Annotated sequence is a unique tag for every piece of the puzzle in these experiments. I want to know if these sequences are conserved if I do the same exact experiment three times and if I can find out what is happening by plotting the variables from the columns.

From the 31 variables available, I want to keep 10 columns (for now) but, if I add the Annotated Sequence Column, it takes ages for the code to run to add the NaN's so I can have an even number of rows for each matrix. Without running the NaN code to even the number of rows, I cannot do statistical analysis or volcano plots for that matter. I can settle to index the mass (m_z_Da) from the original tables to find the actual annotated sequence. From the 31 variables, I am interested in 10 right now as shown below:

'x_ProteinGroups','x_Proteins','x_MissedCleavages','Charge','m_z_Da_', ...
 'MH__Da_','DeltaM_ppm_','Deltam_z_Da_','IonInjectTime_ms_', 'RT_min_'

So the code I have right now works just fine for that (elapsed time is ~2 seconds). Maybe changing the annotated sequence to a categorical variable might work. I will get back to work in a few minutes :)

Thanks

Amanda Figueroa on 26 Nov 2018

To address your second comment:

I wish to get a p-value for each of the individual rows. I tried ttest2 successfully but it gives me one general p value for all rows.

I chose to use the mattest function because I want the individual p values. However, it doesnt work if my rows are uneven for each trial. When I put the nan rows in, I get a p-value of zero for every row.

With p-values of zero for every row (using mattest), my volcano plot throws an error because it ignores zeros and nothing can be plotted.

So I am stuck at this point and am waiting for authorization to send the files and see if we can figure this out :(

dpb on 26 Nov 2018

Edited: dpb on 26 Nov 2018

OK. If you don't want to post files/data in open forum, you can contact me directly thru my link -- I'll be glad to look at them offline and keep any confidentiality needs--I spent 30+ yrs in consulting gig so know that routine well... :)

I need to know more about what we're trying to test as the null hypothesis to understand what p value is needed...and thus how to generate it.

Sign in to comment.

Sign in to answer this question.

Follow Question

Answer 1

Amanda Figueroa on 23 Nov 2018

0 votes

Hmm... It seems that my txtFiles variable is type struct, so I guess that is why length is working right now. The experiments are numbered, so it seems like these are placed in order right now.

2 Comments
Show None Hide None

dpb on 24 Nov 2018

Open in MATLAB Online

Yes, dir returns a 1D struct array so length does work there--I was just making a side note that the particular function can surprise on occasion.

The order ML returns (on Windows anyway) is alphabetical, yes. Again, I was just noting that while it's that way now and this particular set works, neither is really guaranteed.

As an example of what can go wrong, after running

for i=1:20
  fid=fopen(num2str(i,'T%d.txt'),'w');
  fprintf(fid,'%s\n','Record');
  fid=fclose(fid);
end
d=dir('T??.txt');
for i=1:length(d),disp(d(i).name),end
T1.txt
T10.txt
T11.txt
T12.txt
...
T18.txt
T19.txt
T2.txt
T20.txt
T3.txt
T4.txt
...

which alphabetic but not natural order...the Current Folder window in ML would show those in natural order, however.

Again, "just sayin'!" :) It's possible your code will work just fine for the limited universe in which it needs to operate; just making you aware that "stuff happens!".

Amanda Figueroa on 26 Nov 2018

Thanks!

Sorry I haven't written back. I appreciate your helpful comments, I know I will use them in the future. Wrangling the data is my actual test to see if I am the best candidate to analyze data for my lab (I am not part of it yet). I have no problem with seeking your assistance (which I am truly grateful for! :) ) but I need to verify if I can share complete files before actually doing so.

I will get back to you by the end of the day.

Sign in to comment.

I need to optimize this slow code. I am Working with three tables with uneven rows and same columns. Missing data already accounted for but the code runs slow.

13 Comments
Show 11 older comments Hide 11 older comments

Answers (1)

2 Comments
Show None Hide None

Categories

Tags

Community Treasure Hunt

I need to optimize this slow code. I am Working with three tables with uneven rows and same columns. Missing data already accounted for but the code runs slow.

13 Comments Show 11 older comments Hide 11 older comments

Answers (1)

2 Comments Show None Hide None

Categories

Tags

See Also

Community Treasure Hunt

13 Comments
Show 11 older comments Hide 11 older comments

2 Comments
Show None Hide None