How can I specify position and exclude repeated characters using regexp?

1 view (last 30 days)
Erick Alejandro
Erick Alejandro on 3 Nov 2015
Edited: Walter Roberson on 6 Nov 2015
In searching gene sequence data, I want to find sequences that have the form NGGGNGGGN
where N = A, C, T, or G, in any order, of length 1-7. However, I do not want to find N with repeated G, for example I don't want N = GG, AGGA, AGGGA. I want to only find N that includes G but does not have consecutive G like GG, and I don't want to find N where G is first or last such that the GGG would be extended by the presence of the G in N.
I want to use something like expr = 'G{3}[ACTG]{1-7}(?!GG)G{3}' but MatLab does not like this. I'm not very good with conditions in regexp, or regexp in general. Any help is appreciated.

Answers (1)

Nitin Khola
Nitin Khola on 5 Nov 2015
Edited: Walter Roberson on 6 Nov 2015
Thanks for providing a detailed question.
From what I understand, I think you are just looking for sequences that have only one repeated pattern for G's i.e. "GGG". Anything else besides this pattern is unwanted. So I thought we could just do a "strfind" http://www.mathworks.com/help/matlab/ref/strfind.html to look for a pattern of "GGG". If you go through the documentation link I provided, you will notice how "strfind" will return the values of starting indices for the pattern it is searching for. These indices will be helpful in eliminating the sequences of the form that have "AGGGA", for example, in N. So the idea is simple, first do an "strfind" and locate indices for each string that has the "GGG" pattern. Second, eliminate sequences with indices that are not allowed, for example, only the indices of 7 and 16 correspond to valid indices if the length of N is 6. You can even come up with a formula for the "valid sequence indices". For example, length of N = (total sequence length - 6)/3. Valid indices for "GGG" pattern = (length of N + 1) and (2*length of N + 3 + 1) etc. I apologize in advance, if I have committed any arithmetic errors in providing the above example formula.
Also, you may need to loop through your data for this or if all of your data is stored in a cell array, you can take the shorter route of using "cellfun" http://www.mathworks.com/help/matlab/ref/cellfun.html.
Have fun!

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!