Efficient identification of quoted substrings in a substring
Show older comments
I'm looking for help from the matlab string parsing experts out there to help we come up with a computationally efficient way (perhaps using regular expressions), to identify the quoted parts of a string from random sources of text (e.g. a journal article). The method needs to work regardless of whether the quoted substrings are contained inside single or double quotes. Further the text may contain apostrophes either inside or outside of the of the quoted substrings.
For example, in this sentence:
Sally said "It's a wonderful life" when she heard Molly's sister proclaim "It's a great day".
I would like to identify "It's a wonderful life" and "It's a great day", while in this text:
The attributes of the <table> tag were 'width=80%' and 'align="center"'.
I would like to identity 'width=80%' and 'align="center"'. [Note, I purposedly did not show the above example sentences in matlab code, but rather just showed them as free text, so as to not to confuse my question with how to properly capturing such sentences in a matlab variable.]
I recognize these examples are a bit pedantic, but since the code won't be able to control the source of the text it is searching, it needs to be robust across these cases.
I have been able to do this with a "brute force" linear search through the text, but its pretty inefficient and complex. I am not enough of an regexp expert to figure out a way to do this with regular expressions, but I've seen such experts come up with pretty elegant and efficient solutions to such problems. Hence, I was hoping my case might be tantilizing to one of those experts in this community. Thanks for any suggestions
3 Comments
John D'Errico
on 13 Feb 2021
Good luck in this, as I think you will find it a difficult thing to do robustly. You can probably create texts with algorithmically nasty things in it, contained in quotes. Remember there are different kinds of quotes.
aaaaaaaaa"bbbbb'cccccccc"ddd'eeeeee
Is the quoted string you want to find there: "bbbbb'cccccccc", or is it 'cccccccc"ddd'? In either case, it seems there is a spare character, used for some other purpose, and a mark inside the string.
I'd suggest if you have code that does as you like, then use it and don't worry, unless you find it to be a problem in terms of time.
If you did, then it is also relatively easy to search for the locations of all quote symbols. Then decide what is intended to be inside the quotes using some scheme.
Peter
on 13 Feb 2021
Peter
on 14 Feb 2021
Accepted Answer
More Answers (0)
Categories
Find more on Characters and Strings in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!