Sorting excel data (numbers and strings) and outputting sorted data to another excel file.

I have a rather large excel data file that I would like to sort. My goal is to sort the data in three different columns. Two columns are based on keywords within large sentences and one column is just numbers. There are 10 different keywords and 20 different numbers I need to sort with. I would then like the entire row of each row that meets the requirements of keywords and numbers to be placed in a separate excel file.
I am very new to MATLAB and have been struggling with this for some time now. Any help provided would be greatly appreciated.
Kind regards,

14 Comments

Can you attach a slice of you data set?
Look in the documentation for XLSREAD (type in the command line: doc xlsread), SORT, SORTROWS, and UNIQUE, and their output arguments.
I can't attach it due to confidentiality. But I will give your suggestion a try.
@connor mcgarry: you can easily create some sample data that has all of the properties of your confidential data, and upload that.
Sorry for asking this and then being inactive. I have made a similar excel sheet to show better what I am working with. I have begun trying some of the methods Cedric suggested.
In the pdf. The bold numbers and letters plus the final two columns are what I wish to search with.
Kind regards,
@connor mcgarry: Note that a PDF is not an Excel worksheet, and is almost impossible to get any meaningful data out of. A PDF does not have "all of the properties of your confidential data" as you were requested to provide. A PDF has all of the properties of a PDF: nice to look at but useless for storing data.
No one is going to sit a copy out your data from a PDF. If you want help then provide us with some data.
Data extraction heavily relies on the structure of the content and the nature of the data. We have to understand what characters we should expect, what delimiters, whether there are spaces and where, what kind of numbers are present, etc, etc, etc.
Whatever are the restrictions on your data, they don't prevent you to alter the content and share some slice. We need to know the content better; if you have
Name Age Diseases/status # treatments Dates
Dana 57 Hodgkin lymph. [32], Septicemia [27] 3 28/03/2009, 17/12/2012
change it for
Name Age Some label # label Dates
John 23 Sever fever [13], itching feet [102] 29 10/01/2001, 11/07/2010
and we can get the information that we need from there. You can change numbers, names, keywords, whatever, but there is no point for us to spend time explaining little by little general approaches and refine the answer each time you drop a little element of information.
Ok, I will begin changing the data and then I will repost. Sorry for all of these hoops. I'm very new to programming.
No problem, we don't need thousands of lines either, but provide an example that captures the possible variations. To illustrate, if a cell can contain nothing, a single word, or more words separated by commas, there is no point in giving three lines that contain more than 10 words because we don't get that there can be a single or none. In such case, pick one line with an empty cell, one with a single word, and one line with more words, so we have all the information/cases.
I have changed numbers and letters around but file attached is an example of the actual document. I need to be able to search in columns G, U and V. I hope this time it's in a helpful format. Also, I appreciate the help that has been provided so far.
Kind regards,
But see you altered the content of the column that you want to parse not just by changing values. When I see:
"Date: 2015-08-12Text: #100331835 this portion contains text and numbers that are imprtant. The dates within this column are not important. I wish to search for keywords and also numbers mixed with letters. 78034421KJ, C346E, C234A-01, and F23G-02 are some examples of numbers that I wish to search for. There is usually a large amount of information in the form of text with punctuation (ie. () or,) and a much smaller amount of information in numbers."
in a cell, I don't know it if there will be one of more dates, if they start with "Date: " including a white space, if the text is preceded by "Text: ", I don't know either if these keywords are separated by commas, etc.
Don't insert explanations in your data set. Update some values so you are fine regarding the confidentiality and don't touch the rest.
We'll get there ;-) it's just that you don't understand yet what we are using for parsing content.
Cedric, I kept as much as I could in the excel. The column U and the 2nd row include the information that you requested.
Hey Cedric, it looks like this works but when I try and apply it to the real excel doc it gives me an error message regarding the strfind command.
Error using strfind Cell input must be a cell array of strings.
Error in Roughdraft (line 15) numStrPos(:,nsId) = strfind( rawData(:,21), numStr{nsId} ).' ;
The error says it has to do with this line of code: for nsId = 1 : nNumStr numStrPos(:,nsId) = strfind( rawData(:,21), numStr{nsId} ).' ; end
I'll answer in the comments under my answer.

Sign in to comment.

 Accepted Answer

EDIT 09/27 @ 22:57 UTC
I updated your file and attached it to my answer (Data_original.xlsx). I picked on number (093321, defined as a string to keep the trailing zero) and one keyword (F25B1-00) from the content. I copied the cell that you updated to all colored rows, and I altered either the number or the keyword in rows highlighted in dark brown (which leaves rows in yellow untouched).
The following code looks for rows that have either at least one number or at least one keyword, and exports them to an output Excel document. It is just one small application for starting the discussion. See if you understand. For this, display arrays numStrPos, keywordPos, hasNumStr, hasKeyword, and at vector hasOneOfEach, and see if you understand their sizes, content, type, the logical operations &, etc.
%- Read input file.
[~,~,rawData] = xlsread( 'Data_original.xlsx' ) ;
nRows = size( rawData, 1 ) ;
% - Define arrays of numbers (defined as strings) and keywords to match.
numStr = {'093821', '093321'} ;
keywords = {'F25B1-00', 'A25D1-00', 'F25B1-99'} ;
% - Prealloc cell arrays for storing the output of find in column.
nNumStr = numel( numStr ) ;
numStrPos = cell( nRows, nNumStr ) ;
nKeyword = numel( keywords ) ;
keywordPos = cell( nRows, nKeyword ) ;
% - Iterate through elements of numStr and find in all data rows, and store.
% in column.
for nsId = 1 : nNumStr
numStrPos(:,nsId) = strfind( rawData(:,21), numStr{nsId} ).' ;
end
% - Iterate through elements of keywords and find in all data rows.
for kId = 1 : nKeyword
keywordPos(:,kId) = strfind( rawData(:,21), keywords{kId} ).' ;
end
% - Convert to arrays of logicals flagging "found at least one", which
% translates into "check not empty".
hasNumStr = ~cellfun( @isempty, numStrPos ) ;
hasKeyword = ~cellfun( @isempty, keywordPos ) ;
% - Create a column vector of logicals that flag rows that have e.g. at
% least one number AND one keyword.
hasOneOfEach = any( hasNumStr, 2 ) & any( hasKeyword, 2 ) ;
% - Use this vector to pick relevant rows of rawData for export.
xlswrite( 'Data_selection.xlsx', rawData(hasOneOfEach,:) ) ;
===[ FORMER ANSWER, READ IT FOR TRAINING ] =============================
Here is an example if you cannot attach a slice of data.
Say you load some Excel file content
[~,~,raw] = xlsread( 'MyData.xlsx', 'RelevantSheetName' ) ;
where the 3rd output of XLSREAD is raw data (contains both string and numeric data), and you get the following cell array:
raw =
4×3 cell array
'A,C' 'X' [20]
'B,C,E,F' 'X,Y,Z' [35]
'F,A' 'X,Z' [10]
'C,B,A' 'Y' [30]
where the letters represent distinct keywords. Now you want to extract rows that have at least keyword A in column 1, at least keyword X in column 2, and then sort them by number in column 3.
Proceed by little steps. First, are we able to spot rows that have A in column 1? There is string comparison/finding and/or pattern matching. If we google a little, we find MATLAB functions STRCMP, STRCMPI, STRFIND, REGEXP, REGEXPI, etc. For strict comparison, we would have to split using ', ' as a separator and probably iterate through rows for doing that .. maybe STRFIND is easier to use. Can we use it on a single cell content?
>> strfind( raw{1,1}, 'A' )
ans =
1
>> strfind( raw{1,1}, 'C' )
ans =
3
>> strfind( raw{1,1}, 'F' )
ans =
[]
apparently it outputs the position of the match and an empty array otherwise. Can STRFIND operate on a cell array (so we don't have to iterate)? Let's test passing the whole first column of raw:
>> strfind( raw(:,1), 'A' )
ans =
4×1 cell array
[1]
[]
[3]
[5]
apparently we can. Now it would be great if we could use this output to index relevant rows: rows for which there was a match. There are several way to index arrays, e.g. by position/index or using vectors of logicals (true/false) that flag relevant positions. Are we able to generate one by hand?
>> lIndex = [true, false, true, true] ; % or lIndex = logical( [1, 0, 1, 1] ) ;
>> class( lIndex )
ans =
logical
Are we able to use it to extract relevant rows of raw, keeping all columns?
>> raw(lIndex,:)
ans =
3×3 cell array
'A,C' 'X' [20]
'F,A' 'X,Z' [10]
'C,B,A' 'Y' [30]
So now are we able to build such a vector of logicals using the output of STRFIND? First, are we able to test if a single element is empty?
>> matchCol1 = strfind( raw(:,1), 'A' )
matchCol1 =
4×1 cell array
[1]
[]
[3]
[5]
>> isempty( matchCol1{1} )
ans =
logical
0
>> isempty( matchCol1{2} )
ans =
logical
1
Notice that the output of ISEMPTY is a logical. So are we able to apply ISEMPTY to all elements of |matchCol1? Googling would lead us to CELLFUN:
>> lIndexCol1 = ~cellfun( @isempty, matchCol1 )
lIndexCol1 =
4×1 logical array
1
0
1
1
where the ~ is the logical NOT (as the output of CELLFUN is [0;1;0;0]). Putting everything together, we can write:
>> lIndexCol1 = ~cellfun( @isempty, strfind( raw(:,1), 'A' ))
lIndexCol1 =
4×1 logical array
1
0
1
1
>> lIndexCol2 = ~cellfun( @isempty, strfind( raw(:,2), 'X' ))
lIndexCol2 =
4×1 logical array
1
1
1
0
Now we want rows that satisfy conditions on both columns:
>> lIndex = lIndexCol1 & lIndexCol2
lIndex =
4×1 logical array
1
0
1
0
that we can use to extract relevant rows:
>> selected = raw(lIndex,:)
selected =
2×3 cell array
'A,C' 'X' [20]
'F,A' 'X,Z' [10]
Then we need to sort based on column 3. Again, there are several options. Looking at SORT we see that it has a second argument that we could use to re-order selected based on the sorting order of selected(:,3), but SORTROWS may do that in one shot:
>> sorted = sortrows( selected, 3 )
sorted =
2×3 cell array
'F,A' 'X,Z' [10]
'A,C' 'X' [20]
which can be exported to an Excel file using XLSWRITE.
Chance is that the solution to your problem will involve a little bit of all these concepts.

8 Comments

So what have you tried so far and where did you hit a difficulty? Also, I understand the confidentiality issue, but you can easily keep e.g. the first 3-5 rows and update some words and numbers so we cannot identify what you are doing. This would greatly help because then we could work on the real structure and data format.
Thank you for updating your file. See my edited answer.
Connor:
Error in Roughdraft (line 15)
numStrPos(:,nsId) = strfind( rawData(:,21), numStr{nsId} ).' ;
Now you have to debug and understand how the new file is different from your test file. The first thing would be to check that column 21 is the correct column: after it crashes, type
rawData(:,21)
and see if it is correct. If not, adjust the 21.
I bet that at least one cell content is not a string (type/class char). You can check this as follows:
isNotChar = ~cellfun( @ischar, rawData(:,21) )
and look at the output. If it is too long, FIND row IDs that are not char and check in the input Excel file if they are e.g. empty:
rowIds = find( isNotChar )
Then maybe you will realize that some are empty in the Excel document, and that they are numeric NaNs in the imported cell array. You can check this as follows:
rawData(rowIds,21)
If it is the case, you can replace these entries by empty strings for example, before processing rawData(:,21) :
rawData(isNotChar,21) = {''} ;
I figured it out. I had to clear my workspace (rookie mistake). I appreciate all of your help. Thank you so much!
Hey Cedric, may I include your name in a paper I am writing? I would like to cite you and give credit where it is due.

Sign in to comment.

More Answers (0)

Categories

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!