Sort Excel files by file content?

2 views (last 30 days)
chlor thanks
chlor thanks on 6 Jul 2016
Edited: Guillaume on 13 Jul 2016
Hi all, my new challenge with matlab involves filtering files with very inconsistent names. For example,
s =
'HI_B2_TTT9_D452_07052016.xlsx'
'HI_H2G_TTT7_D259_070516.xlsx'
'HI_B2C_TTT9_D1482_070516.xlsx'
'HI_A1C_468_070516_TTT4.xlsx'
'HI__TTT8_862_07052016_G1C.xlsx'
'HI_KA6_TTT4_148_07052016.xlsx'
'8C_HI_279_Potato_07052016.xlsx'
'HI_8C_279_Bacon_TTT52016.xlsx'
The files that I want are the first six files, which have different styles of naming even though they are the same type of files (TTT). While the last two files are undesired and need to be filter out indicated by keywords such as "Potato" and "Bacon".
My goal is to Extract files that contains the keywords "TTT" while eliminate files that have keywords "Potato" and "Bacon", this is not ideal since there are in fact hundreds of these files in my folders outside of this simple example that are constantly updating and I will need to look through them all for other potential unwanted keywords such as "Sour Cream", etc.
My ideal goal will be to extract those TTT files by its content, since all the TTT excel files have a sheet named "cooking is fun" inside while all the other ones do not. Is this feasible and is there a best way to do so?
Thank you so much for reading my concern and any inputs will be greatly appreciated!
  5 Comments
chlor thanks
chlor thanks on 6 Jul 2016
I am thinking that maybe I can do:
Step 1: Extract all "TTT" files
Step 2: Further eliminate any files with keyword "Bacon"

Sign in to comment.

Accepted Answer

Guillaume
Guillaume on 6 Jul 2016
Edited: Guillaume on 13 Jul 2016
You could certainly inspect the content of the excel files to see if there's the worksheet you want. However, since it involves opening and closing every file, it's not going to be particularly fast. It's up to you whether that's important or not:
function hassheet = CheckExcelFilesForWorksheet(folder, filelist, sheetname)
%HASSHEET Check whether or not the given excel files have a sheet with the given name
%folder: folder where the excel files are located (1D char array / string)
%filelist: names of excel files (cell array of 1D char arrays / string array)
%sheetname: name of sheet to find in excel files (1D char array / string)
%hassheet: array the same size as filelist, indicating whether or not the excel file has a sheet with sheetname (logical)
hassheet = false(size(filelist)); %output
excel = actxserver('Excel.Application'); %start microsoft excel
cleanupobj = onCleanup(@() excel.Quit); %close excel when function returns or error occurs
for fileidx = 1:numel(filelist)
%open without updating link and as read only. read only ensure the file can be opened even if it's already in use:
workbook = excel.Workbooks.Open(fullfile(folder, filelist{fileidx}), false, true);
%get the list of worksheet name by iterating over the Sheets collection:
sheetnames = arrayfun(@(i) workbook.Sheets.Item(i).Name, 1:workbook.Sheets.Count, 'UniformOutput', false);
workbook.Close(false);
if ismember(sheetname, sheetnames)
hassheet(fileidx) = true;
end
end
end
edited 13/07/2016: fixed bugs: forgot to close the workbooks, typos
  6 Comments
chlor thanks
chlor thanks on 7 Jul 2016
Edited: chlor thanks on 7 Jul 2016
Ok. So I saved the function on its own, it now gives me the error...
Undefined function or variable 'numfiles'.
Error in CheckExcelFilesForWorksheet (line 10)
for fileidx = 1:numfiles
I am not sure why this line is invalid since I just checked, "numfiles" is exactly the same as if I use "numel(AllExcel)", where AllExcel is a pre-existing cell array of the names of all the available excel files that I have.
Guillaume
Guillaume on 7 Jul 2016
This is a bug of your own. There's no numfiles in my original code. For reference, my original code for that line and the following was:
for fileidx = 1:numel(filelist)
%open without updating link and as read only. read only ensure the file can be opened even if it's already in use:
workbook = excel.Workbooks.Open(fullfile(folder, filelist{fileidx}), false, true);

Sign in to comment.

More Answers (1)

Azzi Abdelmalek
Azzi Abdelmalek on 6 Jul 2016
Edited: Azzi Abdelmalek on 6 Jul 2016
Edited
a=regexp(s,'.+TTT.+','match','once')
b=regexprep(a,'\S+Bacon\S+','')
out=s(~cellfun(@isempty,b))
  3 Comments
Guillaume
Guillaume on 6 Jul 2016
Certainly not! More likely, you'd get a syntax error. I'd recommend you learn the regular expression language (note that this is not a language specific to matlab).
If you want to replace Bacon or Potato, this regex would work:
regexprep(a, '\S+(?:Bacon|Potato)\S+', '')
However, this has nothing to do with the original question: "Sort Excel files by file content?"
chlor thanks
chlor thanks on 6 Jul 2016
After you mention that it will require the opening and closing of each file, I realize that it may take forever for all my files to be filtered every time I run the codes. Thus I am considering doing a quick regexp before I make the attempt of opening any files.
Thank you very much for all your help!

Sign in to comment.

Categories

Find more on Data Import from MATLAB in Help Center and File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!