Problem 61158. Gene Duplication with Sequencing Errors
You are investigating the genome of the bacterium Codex matlabius. A virus that infects C. matlabius is known to insert long, repeated sections of its own genes into the bacterial genome. Your job is to find duplicates in the genome that might signal these viral insertions.
Unfortunately, your gene sequencer isn't perfect and sometimes makes reading mistakes. You need to consider both exact matches and very close matches with no more than 1 mismatch (disagreement between the two sequences).
Given a single string of nucleotide characters taken from the genome, find the longest substring that appears in two non-overlapping locations. The two occurrences can either match exactly or differ by at most one character.
Rules:
- The two occurrences must not overlap
- They must be at least 5 nucleotides in length
- Only characters A (adenine), C (cytosine), G (guanine), or T (thymine) appear in the input
- If the two occurrences differ by exactly one character, mark that position with 'X' in the output
- The 'X' marker must appear in the interior of the string, never at the beginning or end
- If the two occurrences match exactly, return the substring without any 'X'
- If no valid duplicated substring exists, return an empty string
Example 1: Fuzzy match (1 mismatch)
Input
genome = 'AATGCTACCTTAGTACCACTGGATGCTACATTAGA'
Output
dupe = 'ATGCTACXTTAG'
The duplicated gene (with one mismatch at position 8) appears in two places:
Example 2: Exact match (X at beginning is not allowed)
Input
genome = 'AAATCGATCGTTTCGATCG'
Output
dupe = 'TCGATCG'
While there's a potential 8-character fuzzy match, it would require 'X' at the beginning, which is not allowed. Returns the 7-character exact match instead.
Solution Stats
Problem Comments
-
3 Comments
Matthew Bolyard
23 hours and 53 minutes ago
I believe some test cases have multiple solutions. If say there are 2 duplicated genes of length 8, the test cases go with the first matching gene.
Ned Gulley
6 hours and 30 minutes ago
Thanks for the note Matthew. I'll take a look at that.
Ned Gulley
5 hours and 32 minutes ago
I updated the test suite. Hopefully this resolves the ambiguity.
Solution Comments
Show commentsProblem Recent Solvers2
Suggested Problems
-
Gene Duplication with Sequencing Errors
2 Solvers
More from this Author54
Problem Tags
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!