More efficient returning string position in cell string array

Question

dpb on 8 Oct 2014

0
Link

Direct link to this question

https://uk.mathworks.com/matlabcentral/answers/157887-more-efficient-returning-string-position-in-cell-string-array

Commented: dpb on 8 Oct 2014

Given an array such as

>> whos endtxt
Name          Size            Bytes  Class    Attributes
endtxt      137x2             22466  cell

Surely there is a (much) less verbose way to return the location of a specific string within the array than

>> find(~cellfun('isempty',strfind(endtxt(:,2),'FULLER')))
ans =
  46
>>

I've whiffed on an efficient way to do something useful with the cell array of a zillion empty cells excepting for the one(s) of interest...the above does (finally!) work, but surely????

2 Comments
Show NoneHide None

Azzi Abdelmalek on 8 Oct 2014

Can you explain your problem with an example?

dpb on 8 Oct 2014

Open in MATLAB Online

The issue is syntax, Azzi...

strfind in the above returns

>> whos ix
Name        Size            Bytes  Class    Attributes
ix        137x1              8228  cell

Where

>> ix
ix = 
  []
  []
...
  []
  []
  [1]
  []
...
  []
  []
  []
>>

, a cell array with a zillion empty cells and the one (in this case) one of interest. What I'm looking for is the index of the subject string in order to be able to use it to index into another data array for the given donor account so the value I need to return is the position of that one (or positions if more than one) in the array.

I'm almost positive it's just an addressing or some other obvious manipulation of cell array syntax I'm missing (or perhaps there's a much better way to do the search?) but where I was before finally beating the above into the value I was looking for included

>> find(ix)
Undefined function 'find' for input arguments of type 'cell'. 
>> find(ix{:})
Error using find
Too many input arguments. 
>> find([ix{:}])
ans =
   1
>>

The above with find is disappointing; not sure if it's been resolved since R2012b that is what I'm using. The other then is a problem that the comma list is truncated to the value in the one non-empty cell, not an array (makes sense, but not useful for the present purpose although is for many other cases).

Hopefully that is sufficient background...

Sign in to comment.

Sign in to answer this question.

Answer 1

Guillaume on 8 Oct 2014

0
Link

Direct link to this answer

https://uk.mathworks.com/matlabcentral/answers/157887-more-efficient-returning-string-position-in-cell-string-array#answer_154434

Open in MATLAB Online

Assuming that the string you're looking for ('FULLER') is the exact match for one of the string in the cell array (and not just a substring), then

find(ismember(endtxt(:, 2), 'FULLER'))

1 Comment
Show -1 older commentsHide -1 older comments

dpb on 8 Oct 2014

Open in MATLAB Online

It is NOT an exact match, unfortunately. Sorry, should've mentioned; that's why strfind since it locates the substring within the overall string.

I'm searching for a donor in a list where the name in the list is a descriptive one that is not normalized (in a database sense). The actual string in this case is

FULLER, BILL W. & BEATRICE F.

There are other examples such as

CLASS OF 57
LASTNAME, VIRGINIA
LASTNAME, VIRGINIA & BILL
CUP (First Lastname & Spousename)
LASTNAME, ANTHONY MEM
NAME SCHOLARSHIP
LASTNAME,FIRSTNAME
...
LASTNAME FAMIY
ELKS SCHOLARSHIP
EPWORTH/AH SCH
...

Thus, not only does one needs must know a fair amount, what one has to search for is pretty specific. I'm working towards correcting these issues, but that's a longer-range project than the present short-term goal.

Sign in to comment.

Answer 2

matt dash on 8 Oct 2014

0
Link

Direct link to this answer

https://uk.mathworks.com/matlabcentral/answers/157887-more-efficient-returning-string-position-in-cell-string-array#answer_154438

If i'm understanding correctly: [junk,answer] = ismember('FULLER',endtxt(:,2))

2 Comments
Show NoneHide None

matt dash on 8 Oct 2014

Oh. For inexact matching i think your above solution is as good as it gets.

dpb on 8 Oct 2014

That surely seems a lot of rigamarole to handle the resulting cell array, though, Matt -- I thought there must be some syntax "trick" I'm not seeing since I rarely use cell arrays and this type of database stuff isn't a general thing I do, either; I almost exclusively just deal with numerics.

It does work, granted, so guess I'll just go on for now (actually, I already had, of course, just taking a break).

Sign in to comment.

Answer 3

matt dash on 8 Oct 2014

0
Link

Direct link to this answer

https://uk.mathworks.com/matlabcentral/answers/157887-more-efficient-returning-string-position-in-cell-string-array#answer_154467

Open in MATLAB Online

Well, here is an option with even more rigamarole, but it is faster if that matters. Basically make a copy of your text that is not a cell array, and cross reference it with a vector indicating where row breaks occur. For potentially even more speed you could remove the find entirely by keeping a vector of row indices for every character in the text (if memory is not an issue)

1) use cellfun(@length,<cells>) to get the length of each cell, then cumsum this to get the start index of each line (pre-pend a 0 at the beginning) 2) convert the cell arrays to one long string with [<cells>{:}] 3) now just use strfind on this one string to get the index 4) cross reference this with the index vector from (1) to see which line it begins in

Ridiculous, but on my computer it is 10-30x faster than find(~cellfun('isempty',strfind(lines,teststr)))

and seems to get faster for larger amounts of text.

code:

fid = fopen('book.txt','r'); %some long text file
teststr='Lampsacus' %some word in it
%read text file:
tline=fgetl(fid);
lines={};
while ischar(tline)
  lines{end+1}=tline;
  tline=fgetl(fid);
end
%method 1:
q=cellfun(@length,lines);
starts = [0 cumsum(q)];
alltxt=[lines{:}];
tic
a=strfind(alltxt,teststr);
for i = numel(a):-1:1
  idx(i)=find(starts<=a(i),1,'last');
end
toc
%method 2:
tic
  find(~cellfun('isempty',strfind(lines,teststr)));
toc

1 Comment
Show -1 older commentsHide -1 older comments

dpb on 8 Oct 2014

Interesting, Matt...I (think?) I'm not terribly surprised a character array search and find() on an array is faster than the cell array string search followed by a cellfun() call alternative altho your timings of 10X and greater are disappointing for cellstrings.

Fortunately, this is a pretty small database and it isn't something need to do except interactively so performance itself isn't really an issue. As noted, just seemed like should be a better way to get the position out of the returned cell array.

Sign in to comment.

More efficient returning string position in cell string array

2 Comments
Show NoneHide None

Answers (3)

1 Comment
Show -1 older commentsHide -1 older comments

2 Comments
Show NoneHide None

1 Comment
Show -1 older commentsHide -1 older comments

See Also

Categories

Tags

Community Treasure Hunt

More efficient returning string position in cell string array

2 Comments Show NoneHide None

Answers (3)

1 Comment Show -1 older commentsHide -1 older comments

2 Comments Show NoneHide None

1 Comment Show -1 older commentsHide -1 older comments

See Also

Categories

Tags

Community Treasure Hunt

2 Comments
Show NoneHide None

1 Comment
Show -1 older commentsHide -1 older comments

2 Comments
Show NoneHide None

1 Comment
Show -1 older commentsHide -1 older comments