Problem 939. DNA Pattern Match: Performance Metric - Speed

The Challenge is to Rapidly find matches of DNA sequences, Length=6, in a 1,800,000 long DNA file.

At IMACST the paper An Intelligent and Efficient Matching Algorithm to Finding a DNA Pattern claimed an astounding time improvement from 9.94 seconds to 7.84 seconds, 21% time reduction, to match six segments of length 6 in a 1.8M long DNA file. Basic probability asserts 1.8M/4^6 * 6 = 2637 matches. The paper's test case produced 2346 matches. The method employed used text processing in C++. The paper's L=25 and L=50 cases will be later challenges.

Matlab can achieve matching a six pattern set of L=6 in <15 msec (i5/16GB). This is merely a 99.8% time reduction.

Challenge Description: DNA is made of letters ACGT, wiki DNA, which for the purposes of this Matlab Cody Challenge are given values 0 thru 3. (ACGT= 0123)

Input: [DNA, DNA_ID, Patterns]

  • DNA is a 1.8M long uint8 row vector of values 0 thru 3.
  • DNA_ID identifies the DNA segment being processed. Multiple calls using the same DNA_ID will be performed. The first call of a DNA_ID is not timed.
  • Patterns is an Nx6 uint8 array where each row corresponds to a search pattern.

Output: Locations

Locations of all start indices that match any of the patterns

Scoring: Average Time (msec) for a block of L=6 patterns

Example:

  • DNA= [0 1 2 3 3 2 1 0 1 2 3 3]
  • Pattern = [2 3 3 2 1 0]
  • Locations = [3]

Hints:

  • Vectorization (Base 4 to Base 10 of 6 character words)
  • Bitshift/Reshape to create all words
  • Logical Indexing

Coming soon: Genome DNA sequencing of PhagePhix174 and Haempphilus Influenza

Solution Stats

42.31% Correct | 57.69% Incorrect
Last Solution submitted on Nov 30, 2023

Problem Comments

Solution Comments

Show comments

Problem Recent Solvers8

Suggested Problems

More from this Author294

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!