More efficient returning string position in cell string array

3 views (last 30 days)
dpb
dpb on 8 Oct 2014
Commented: dpb on 8 Oct 2014
Given an array such as
>> whos endtxt
Name Size Bytes Class Attributes
endtxt 137x2 22466 cell
Surely there is a (much) less verbose way to return the location of a specific string within the array than
>> find(~cellfun('isempty',strfind(endtxt(:,2),'FULLER')))
ans =
46
>>
I've whiffed on an efficient way to do something useful with the cell array of a zillion empty cells excepting for the one(s) of interest...the above does (finally!) work, but surely????
  2 Comments
dpb
dpb on 8 Oct 2014
The issue is syntax, Azzi...
strfind in the above returns
>> whos ix
Name Size Bytes Class Attributes
ix 137x1 8228 cell
Where
>> ix
ix =
[]
[]
...
[]
[]
[1]
[]
...
[]
[]
[]
>>
, a cell array with a zillion empty cells and the one (in this case) one of interest. What I'm looking for is the index of the subject string in order to be able to use it to index into another data array for the given donor account so the value I need to return is the position of that one (or positions if more than one) in the array.
I'm almost positive it's just an addressing or some other obvious manipulation of cell array syntax I'm missing (or perhaps there's a much better way to do the search?) but where I was before finally beating the above into the value I was looking for included
>> find(ix)
Undefined function 'find' for input arguments of type 'cell'.
>> find(ix{:})
Error using find
Too many input arguments.
>> find([ix{:}])
ans =
1
>>
The above with find is disappointing; not sure if it's been resolved since R2012b that is what I'm using. The other then is a problem that the comma list is truncated to the value in the one non-empty cell, not an array (makes sense, but not useful for the present purpose although is for many other cases).
Hopefully that is sufficient background...

Sign in to comment.

Answers (3)

Guillaume
Guillaume on 8 Oct 2014
Assuming that the string you're looking for ('FULLER') is the exact match for one of the string in the cell array (and not just a substring), then
find(ismember(endtxt(:, 2), 'FULLER'))
  1 Comment
dpb
dpb on 8 Oct 2014
It is NOT an exact match, unfortunately. Sorry, should've mentioned; that's why strfind since it locates the substring within the overall string.
I'm searching for a donor in a list where the name in the list is a descriptive one that is not normalized (in a database sense). The actual string in this case is
FULLER, BILL W. & BEATRICE F.
There are other examples such as
CLASS OF 57
LASTNAME, VIRGINIA
LASTNAME, VIRGINIA & BILL
CUP (First Lastname & Spousename)
LASTNAME, ANTHONY MEM
NAME SCHOLARSHIP
LASTNAME,FIRSTNAME
...
LASTNAME FAMIY
ELKS SCHOLARSHIP
EPWORTH/AH SCH
...
Thus, not only does one needs must know a fair amount, what one has to search for is pretty specific. I'm working towards correcting these issues, but that's a longer-range project than the present short-term goal.

Sign in to comment.


matt dash
matt dash on 8 Oct 2014
If i'm understanding correctly: [junk,answer] = ismember('FULLER',endtxt(:,2))
  2 Comments
dpb
dpb on 8 Oct 2014
That surely seems a lot of rigamarole to handle the resulting cell array, though, Matt -- I thought there must be some syntax "trick" I'm not seeing since I rarely use cell arrays and this type of database stuff isn't a general thing I do, either; I almost exclusively just deal with numerics.
It does work, granted, so guess I'll just go on for now (actually, I already had, of course, just taking a break).

Sign in to comment.


matt dash
matt dash on 8 Oct 2014
Well, here is an option with even more rigamarole, but it is faster if that matters. Basically make a copy of your text that is not a cell array, and cross reference it with a vector indicating where row breaks occur. For potentially even more speed you could remove the find entirely by keeping a vector of row indices for every character in the text (if memory is not an issue)
1) use cellfun(@length,<cells>) to get the length of each cell, then cumsum this to get the start index of each line (pre-pend a 0 at the beginning) 2) convert the cell arrays to one long string with [<cells>{:}] 3) now just use strfind on this one string to get the index 4) cross reference this with the index vector from (1) to see which line it begins in
Ridiculous, but on my computer it is 10-30x faster than find(~cellfun('isempty',strfind(lines,teststr)))
and seems to get faster for larger amounts of text.
code:
fid = fopen('book.txt','r'); %some long text file
teststr='Lampsacus' %some word in it
%read text file:
tline=fgetl(fid);
lines={};
while ischar(tline)
lines{end+1}=tline;
tline=fgetl(fid);
end
%method 1:
q=cellfun(@length,lines);
starts = [0 cumsum(q)];
alltxt=[lines{:}];
tic
a=strfind(alltxt,teststr);
for i = numel(a):-1:1
idx(i)=find(starts<=a(i),1,'last');
end
toc
%method 2:
tic
find(~cellfun('isempty',strfind(lines,teststr)));
toc
  1 Comment
dpb
dpb on 8 Oct 2014
Interesting, Matt...I (think?) I'm not terribly surprised a character array search and find() on an array is faster than the cell array string search followed by a cellfun() call alternative altho your timings of 10X and greater are disappointing for cellstrings.
Fortunately, this is a pretty small database and it isn't something need to do except interactively so performance itself isn't really an issue. As noted, just seemed like should be a better way to get the position out of the returned cell array.

Sign in to comment.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!