Problem 811. Genome Sequence 004: Long 3rd Generation Segment Correction

The Melopsittacus undulates genome, Parrot Budgerigar, was successfully sequenced in July 2012 using long 3rd Gen sequences provided by PacBio. The Assemblathon Genome Contest led the team of Phillippy, Koren and Jarvis to successfully Sequence Parrot DNA using the PacBio 3rd Generation data and Illumina 2nd Gen data.

The 3rd gen PacBio data is very long, 1K-20K, but has 15% error rate. The Illumina data is 100-500 long with <1% error rate. Jarvis and his team combined this data to achieve < 0.1% error rate.

Genome Challenge 004 is the correction of simplified PacBio simulated reads with high error rate.

Input:

Call 1: empty array, segment Width, Flag=0

Call 2: N PacBio DNA vectors (N x width), Segment Width, Flag=1

Output:

Call 1: empty vector, Number of Requested Vectors

Call 2: Corrected DNA vector, Number of Requested Vectors

Score: Number of N vectors used to produce correct vector for w=1024 case

The first call to the PacBio_fix routine returns the number of vectors requested to produce a final product. This may be a function of w.

The second call to PacBio_fix will have a DNA matix (N x width) and flag=1.

The response to the second call is the fixed DNA sequence, vector of width w.

example: First call return : N=3

01230123111122223333 Truth
Input example
01232123112122221332 Injected errors
01130123111122123323
11230133121122223333
Output: 
01230123111122223333 Truth, hopefully

This data is simplified by only having simple substitutions and the data sets are provided pre-aligned.

The real PacBio data is quite a bit more complicated. Values may be added, deleted, substituted, and are of varying lengths. This causes alignment issues.

Follow-Up Challenges: Sample Data from the PacBio site for Lambda Phage will be molded into various Challenges. Possible challenges are correcting individual long segments and assembling multiple long segments into the full Lambda Phage genome. The Parrot genome is too big for Cody to solve in 50 seconds.

Solution Stats

12.5% Correct | 87.5% Incorrect
Last Solution submitted on Feb 20, 2019

Solution Comments

Show comments

Problem Recent Solvers2

Suggested Problems

More from this Author308

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!