Extract only text between quotes of a string

Folks, could use an assist. New to REGEXP but persistent. Desire only the literal text between quotes for this string:
imt = "e";
[subchunk] = regexp(textline,'\".*?\"','match','ignorecase');
imt = subchunk;
When I return the argument from my readtext file function, and print using disp(imt), I get this:
'"e"'
When I only want:
e
I assume the single quotes are associated with the disp() function?

 Accepted Answer

subchunk = regexp(textline, '(?<=")[^"]+(?=")', 'match');
imt = subchunk{1};
You could also consider just using a basic strfind() for '"', removing everything up to the first match and everything from the second match on.

8 Comments

Yeah, I've seen others mention use of strfind(). Still, I would like to become more proficient using REGEXP. The rest of my code uses REGEXP for splitting,etc,- nice to stay consistent.
Does anyone have a suggestion to offer? I would like to use a REGEXP to solve this question.
The regexp() I already gave is the solution.
Apologies, multi-tasking for the last several days, thought you simply pasted my code for reference.
So, your regex says:
form a grouping, looking for any number of characters that match double quotes, but not including double quotes. Then look for any character(s),not double quotes,...and have trouble following the logic after that.
I note you also changed the way I handled the array. For my benefit and others, can you explain the differences?
%before
[subchunk] = regexp(textline,'\".*?\"','match','ignorecase');
imt = subchunk;
%after
subchunk = regexp(textline, '(?<=")[^"]+(?=")', 'match');
imt = subchunk{1};
The {1} is to account for your "I assume the single quotes are associated with the disp() function?"
The ignorecase option is not needed because you have no case-sensitive characters in the pattern that you are matching against.
[^"]+ means to look for a one or more characters that are not double-quotes (extending as far as possible). That is the first operation logically executed. Then, the (?<=") before that says that the potential match just located is not to be considered (and another search is to be done) unless there was a double-quote immediately before the stretch of non-double-quotes. Then the (?=") after says that the potential match just located is not to be considered (and another search is to be done) unless there is a double-quote immediately after the stretch of non-double-quotes. The look-behind and look-ahead at the double-quotes do not extend the match at all: the match will not include the double-quotes.
Another way of thinking about the pattern is, "Look for a double-quote, but do not include it in the match. Then look for the longest stretch of non-double-quote characters after that, with at least one character, and those non-double-quotes are to be included in the match. Then look right after the potential match and ensure there is a double-quote after it, but do not include that double-quote in the match.
You can _almost_ simplify the expression to '(?<=")[^"]+' but the difference between that and what I wrote is that what I wrote must have a trailing double-quote whereas the shorter version would be allowed to end at the end of the string even if no double-quote had been found.
Walter, your regexp works perfectly! For clarity (and for others...);
Why did you not have to repeat the first () search expression as follows:
(?<=")[^"]+(?<=")
Would 2nd () search(?=") not have included the trailing double quotes (obviously not the case)?
The return argument is shown using disp()= e (no quotes). The expression,
imt = subchunk;
would still work for purposes of storing the value and returning to main program to be used say for a case statement check? My understanding is the curly braces permit index referencing. But subchunk is all one string right?
(?<=") is always "look behind" (from where you are), so
(?<=")[^"]+(?<=") would mean to look behind for a double-quote, match a bunch of non-double-quote stuff, and then look behind from between the last non-double quote and the next character (or end of string) to see if the previous character was a double-quote. Which it could not be because it was only non-double-quotes in that pattern.
Look-behind from where you are, look ahead from where you are, different operators.
The output from regexp() is always a cell array, even when only one thing is being returned. You can return that cell array, and that might be appropriate in some cases, but be sure you do not try to switch() on the cell array itself: switch on the _content_ of the cell array. The
imt = subchunk{1};
strips away the cell array layer, leaving imt as a plain string (which _can_ be switch()'d on.)
Awesome. Thanks for your help.

Sign in to comment.

More Answers (1)

From R2020b, you can use pattern, much easier than the complicated regexp function.
For your pb, you the extractBetween function seems the best one:
subchunk = extractBetween("'e'", "'", "'")

Categories

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!