Fastest way to search files by pattern name

I have a main folder with a lot of subfolders (thousands). I want to load files from only specific subfolders, that can be found by specific pattern in the subfolder name. Then, in each of the subfolders, there are tens of sub-subfolders, where I also have to go to only specific ones, which again can be found by a pattern in the name. To extract needed files, I have implemented two ways of doing this via dir function: 1) one line, just using the whole path with subfolders and sub-subfolders; 2) firstly, searching for all subfolders and then searching for sub-subfolders in a for loop over the subfolders. Turns out, that the latter is much faster. Could you explain why?
%first way
files = dir(fullfile(main_folder,'*_data/*_file_to_load/file1.mat'));
%second way
subfolders = dir(fullfile(main_folder,'*_data/');
files = cell(1,numel(subfolders));
for i = 1:numel(subfolders)
files{i} = dir(fullfile(subfolders(i).folder,subfolders(i).name,'*_file_to_load/file1.mat'));
end

6 Comments

I don't have the actual answer, but you are aware that you're overwriting the result every iteration?
I wouldn't trust the timing of any code where significant amounts of text is written to the console.
@Rik, thanks for pointing the overwriting out. That was more like a pseudocode not to include the particulars, I have edited the code. I didn't quite get what do you mean by writing text in the console?
I didn't notice the details on first reading; @Rik is it possible there weren't semicolons at the ends of the code lines? It wouldn't be an issue with the presently shown code.
I was indeed refering to the lack of semicolons in the original code.
I would not trust timings on this:
for n=1:100
x = rand
end
Insteading you should be timing this:
for n=1:100
x = rand; % no output to command window
end
@Rik, yes, I got it. However, timings are counted properly (semicolons are present)
@Anton Baranikov did you overlook the Answer below in the official Answer section of the page? Did you only see the comments up here at the top where people are not giving answers but are asking for clarification of the question? If you saw my Answer below, then explain why it doesn't work, or let me know that it did work.

Sign in to comment.

 Accepted Answer

dpb
dpb on 17 Apr 2023
Edited: dpb on 17 Apr 2023
As far as the original Q?, it's owing to how the underlying OS processes the dir command -- when you ask for a directory listing of a chain of subdirectories from a higher level, those aren't necessarily stored in sequence on disk in the pattern in which they appear so the dir command has to traverse the whole directory structure from the top until it gets all the way to the bottom; it also doesn't know where the match may stop so it has to do everything possibly reacheable from the very topmost location.
In the second case, you're giving it the starting point underneath the specific folder and that chain to the bottom is undoubtedly only one level deep. It's just not doing nearly as much work in the second case as must do in the first.
The fastest way will be to limit the search to as shallow a depth search as your a priori knowledge of the structure can make it. More shallow searches will virtually always beat one deep one.

2 Comments

Perfect, that is exactly, what I wanted to know!
You'll trade some coding complexity/thinking about the actual data structure for better performance this way. The one time investment may well pay off in the long run if it's a case that will occur often; particularly if you can also automate the generation of the order structure programmatically.

Sign in to comment.

More Answers (2)

Use contains to see if the pattern is in the folder or file name. Process the ones you want, and skip the ones you don't want by calling continue
if contains(thisSubFolderName, 'patternIDoNotWant')
continue % Skip to bottom of for loop
end

4 Comments

@Image Analyst, thanks for the reply. Actually, if you read the question carefully, I didn't ask for a different implementation, but wanted to compare the two ways mentioned in the question and understand their fundamental differences. Anyway, I didn't quite get how your code would work. I don't have a pattern that I would like to avoid, more precisely I have too many combinations of patterns that I would like to avoid, so it is easier to use pattern that I would like to match.
To process only names that meet a set of pattterns, here is one way:
for i = 1:numel(subfolders)
if contains(thisSubFolderName, 'patternIWant1') || contains(thisSubFolderName, 'patternIWant3') || contains(thisSubFolderName, 'patternIWant3')
% Process this file
end
end
or you could try using ismember
@Image Analyst, should it work faster than dir?
"...or you could try using ismember"
Actually, contains (and friends) work same...
if contains(thisSubFolderName, 'patternIWant1') || contains(thisSubFolderName, 'patternIWant3') || contains(thisSubFolderName, 'patternIWant3')
could be written as
if contains(thisSubFolderName, {'patternIWant1','patternIWant2','patternIWant3'})
Have to be careful with contains however, that it is the comparison wanted because it matches any substring within the searched string.

Sign in to comment.

This is an old thread at this point but I have a file exchange utility "fsfind" that is purpose-built for this application.
files = fsfind(main_folder, 'file1.mat', 'DepthwisePattern', {'.*_data','.*_file_to_load'})
The inputs support regular expressions (see documentation for "regexp") and only subfolders that match the pattern will be searched. I use it to efficiently search very deep directory structures (10+ levels).

Categories

Asked:

on 16 Apr 2023

Answered:

on 21 Apr 2025

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!