I need to optimize this slow code. I am Working with three tables with uneven rows and same columns. Missing data already accounted for but the code runs slow.
Show older comments
Hi,
I am working on importing and comparing large chunks of data shown as tables. I have 3 .txt files which I want to import as tables. The problem is that,to concatenate them, I need to add NaNs to have variables with the same size to perform operations in Matlab (using the rs variable). The three tables are table C which has a size of 4873x15, table D with 5117x15 and table E with 5216 x 15. To import these, I have a bunch of missing data and variable types, so I used the import wizard to generate a function (importpsm) strictly for these sort of tables. Lastly, I want to save these tables so I don't have to run this code every time I need to analyze other variables. I have all the toolboxes available right now.
Here is the code, it is part of a for loop to access different txt files pertaining to three output files (B) It would be nice to have increasing tables with increasing B, but I have not found a way to do this yet.
...for loop starts after I define k as 1,A as 2, and B as 3...
elseif A == 2
filePattern = fullfile(myFolder, '*PSMs*.txt');
txtFiles = dir(filePattern);
for k = 1
while B == 3
C=importpsm(txtFiles(k).name);
D=importpsm(txtFiles(k+1).name);
E=importpsm(txtFiles(k+2).name);
rs=array2table(nan(max([size(C) size(D) size(E)]),width(C)),'variablenames',C.Properties.VariableNames);
rs(1:height(C),1:end)=C
end
end
13 Comments
dpb
on 21 Nov 2018
Those aren't "big" files at all; you should be able to deal with them quite quickly.
Unfortunately, the code generated by the import wizard is generally not particularly "smart"; probably you'd be better off to use the detectImportOptions function on a representative of the file type and then edit it as needed for the fine points. Then, save that import object and use it and readtable directly.
It's not at all clear to me what you mean by the last sentence above?
Attaching part of a sample file would let somebody see the issues you're dealing with, specifically; it's hard to say something that really helps a specific problem from just a generic description.
Amanda Figueroa
on 21 Nov 2018
dpb
on 22 Nov 2018
As long as the data type by column is consistent you should have no significant issues.
I'm curious, though...you've got three files each with 15 columns -- are these fifteen different variables for each; hence you concatenating horizontally?
If they're the same variables but one variable is different by which to segregate them is the reason, then it would probably be far more efficient to do something other than concatenate them into one table -- a cell array or somesuch, maybe.
Need to know more about what the data format is, what the data really are and what is actual processing to be able to really make any specific recommendations, however, as noted before.
Amanda Figueroa
on 22 Nov 2018
Edited: dpb
on 22 Nov 2018
dpb
on 22 Nov 2018
Totally out of my field of expertise; never heard of the volcano plot before, but ... does it make any sense to mix the data together to do those plots? I'd think you'd want each experiment plotted separately but perhaps overlaid to view differences, not mixed up together???
To use the detectImportOptions object, I was suggesting you call
C=detectImportOptions(txtFiles(k).name);(txtFiles(k).name);
for a representative example of each file type (I thought they were all the same format from initial description?) outside the individual case and make whatever changes in it for the actual variable descriptions that it doesn't manage to recognize correctly (if there's a missing value in the first line for a given file, for example, it will return that as string rather than numeric and that isn't the way you really want to import the file; you need an accurate representation of what the file content is defined as to use). Then, you save that import options object and use it for every file of that type.
Again, it would make things much simpler to try to help specifically if you would attach a sample of the data file(s) that are giving trouble with a description of how want the various variables to be interpreted.
dpb
on 22 Nov 2018
As some comments on coding structure and style...
- For user input of files, look at menu or dialog for a little more user-friendly
- Can't you determine the number of trials programmatically given the files to look at?
- For selecting code sections, this structure would fit a switch structure very well
- If factor out each of those areas into smaller functions that can be tested individually, top level complexity is reduced drastically and testing/debugging greatly improved
Just some gentle thoughts, suggestions... :)
Amanda Figueroa
on 23 Nov 2018
Edited: dpb
on 23 Nov 2018
dpb
on 23 Nov 2018
The willingness to dig in and tackle the job on your own is most admirable -- you've made good progress and undoubtedly have learned quite a bit along the way! :) Big plus for that!!!
I can see what you've done and can kinda' guess the "why" of how you got there...but (there's always a "but", isn't there??? :) ) it seems to me one could probably avoid quite a bit of those internal machinations as well.
What I guess I'm puzzled at though is that you indicated in the beginning files of 15 columns but you end up with the 31 variables -- I still don't have a clear picture of the raw data without more time trying to create what must actually be happening in detail from the code than want to take... :)
I'm still puzzled why there should be any significant problem in processing only 5K lines of any file but to try to get to the bottom of that really would require seeing a portion of the actual text imput file(s), not the end result.
One thing I notice in the Data table is AnnotatedSequence is a string array -- I wonder if that might not also be a categorical variable or is it, indeed, just comments?
BTW,
for B = length(txtFiles)
probably is not going to do what you want... length returns a single value so B will be only the one number and the for loop only will execute once with that particular value for B. But the variables inside are dependent upon k which is set outside the loop.
While it may work for the given case, it doesn't seem at all generic; is it true that there are, will be and always will be specifically three sequentially-named files? You can't always rely on the directory order being in a specific order; generally will be returned an alphabetic sorted order, but that is not documented behavior.
Aside: While it isn't an issue here, there's a nuance to length to be aware--it returns max(size()) for the argument so can be suprised sometimes...for example,
>> c=char('Fred','Tom')
c =
2×4 char array
'Fred'
'Tom '
>> length(c)
ans =
4
>>
returns the length of the strings in the character array, not the number of rows...
>> c=char('A','B','C')
c =
3×1 char array
'A'
'B'
'C'
>> length(c)
ans =
3
>>
otoh, returns the number of rows in the array since that's bigger numerically than the length of each character array in the 2D char array.
IOW, "there be dragons!"
Amanda Figueroa
on 23 Nov 2018
dpb
on 26 Nov 2018
I missed this comment earlier so if you've not gone away totally...
"...if I add the Annotated Sequence Column, it takes ages for the code to run to add the NaN's so I can have an even number of rows for each matrix"
I do wish you would attach one of the input files (even a short subsection if they're very long) so we can see what you're dealing with, explicitly. This should not be a hard problem to solve but we can't fix what we can't see...
"Without running the NaN code to even the number of rows, I cannot do statistical analysis or volcano plots for that matter."
I don't follow this reasoning at all -- ttest2 is able to run with different sample sizes to match the actual situation; you don't need to mung on the data for that purpose. Also, you can plot whatever you want with the actual data, plotting one, then using hold on and adding the other plots is the same thing with much less effort as using the NaN as empty placeholder.
We can help if you'll let us... :)
Amanda Figueroa
on 26 Nov 2018
OK. If you don't want to post files/data in open forum, you can contact me directly thru my link -- I'll be glad to look at them offline and keep any confidentiality needs--I spent 30+ yrs in consulting gig so know that routine well... :)
I need to know more about what we're trying to test as the null hypothesis to understand what p value is needed...and thus how to generate it.
Answers (1)
Amanda Figueroa
on 23 Nov 2018
0 votes
2 Comments
dpb
on 24 Nov 2018
Yes, dir returns a 1D struct array so length does work there--I was just making a side note that the particular function can surprise on occasion.
The order ML returns (on Windows anyway) is alphabetical, yes. Again, I was just noting that while it's that way now and this particular set works, neither is really guaranteed.
As an example of what can go wrong, after running
for i=1:20
fid=fopen(num2str(i,'T%d.txt'),'w');
fprintf(fid,'%s\n','Record');
fid=fclose(fid);
end
d=dir('T??.txt');
for i=1:length(d),disp(d(i).name),end
T1.txt
T10.txt
T11.txt
T12.txt
...
T18.txt
T19.txt
T2.txt
T20.txt
T3.txt
T4.txt
...
which alphabetic but not natural order...the Current Folder window in ML would show those in natural order, however.
Again, "just sayin'!" :) It's possible your code will work just fine for the limited universe in which it needs to operate; just making you aware that "stuff happens!".
Amanda Figueroa
on 26 Nov 2018
Categories
Find more on Statistical Analysis in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!