srafasterqdump

Download FASTQ or FASTA files from SRA

Since R2024a

Syntax

outputFileNames = srafasterqdump(accessionNumbers)

outputFileNames = srafasterqdump(accessionNumbers,SRAFasterqDumpOptions)

outputFileNames = srafasterqdump(accessionNumbers,Name=Value)

Description

outputFileNames = srafasterqdump(accessionNumbers) downloads the corresponding files from SRA (Sequence Read Archive) [1] for the specified accession numbers and returns the names of the downloaded files.

srafasterqdump requires the SRA Toolkit for Bioinformatics Toolbox™. If this support package is not installed, then the function provides a download link. For details, see Bioinformatics Toolbox Software Support Packages.

example

outputFileNames = srafasterqdump(accessionNumbers,SRAFasterqDumpOptions) uses additional options specified by SRAFasterqDumpOptions.

example

outputFileNames = srafasterqdump(accessionNumbers,Name=Value) specifies additional options using one or more name-value arguments. For example, you can specify to retrieve the FASTA-formatted file using the FastaOutput name-value argument.

example

Examples

collapse all

Download NGS Data from SRA

This example uses:

Open Live Script

Download some paired-end sequencing data in a FASTQ format using an accession run number SRR11846824 that has two reads per spot and has no unaligned reads. Downloading the data may take a few minutes.

tbl = srafasterqdump("SRR11846824")

tbl=1×2 table
    "SRR11846824_1.fastq"    "SRR11846824_2.fastq"

By default, the function uses the SplitType="SplitThree" option and downloads only biological reads. Specifically, the function splits spots into reads. For spots having two reads, the function produces *_1.fastq and *_2.fastq, represented by the Reads_1 and Reads_2 columns. If there are any unaligned reads, the function saves unaligned reads in a *.fastq file, which would be represented by the Reads column. Because there are no unaligned reads within this accession, the function did not produce a *.fastq file, and the output table has no Reads column. For details, see SplitType.

You can also specify other download options using SRAFasterqDumpOptions. For instance, use FastaOutput=true to get the FASTA-formatted file.

sraopt = SRAFasterqDumpOptions;
sraopt.FastaOutput = true;
tbl2 = srafasterqdump("SRR11846824",sraopt);

Alternatively, you can specify the options as name-value arguments instead of using the options object.

tbl2 = srafasterqdump("SRR11846824",FastaOutput=true);

You can also download the data in a SAM format using srasamdump.

samFile = srasamdump("SRR11846824")

samFile = 
"SRR11846824.sam"

Specify the download options using an SRASAMDumpOptions object. For instance, specify the output file name and compress the output file using bzip2.

samdumpopt = SRASAMDumpOptions;
samdumpopt.OutputFileName = "SRR11846824.sam.bz2";
samdumpopt.BZip2 = true

samdumpopt = 
  SRASAMDumpOptions with properties:

   Default properties:
       ExtraCommand: ""
        FastaOutput: 0
        FastqOutput: 0
               GZip: 0
      HideIdentical: 0
         IncludeAll: 0
      MinMapQuality: 0
      OutputPrimary: 0
    OutputUnaligned: 0
            Version: "3.0.6"

   Modified properties:
     OutputFileName: "SRR11846824.sam.bz2"
              BZip2: 1

bzFile = srasamdump("SRR11846824",samdumpopt)

bzFile = 
"SRR11846824.sam.bz2"

After downloading the SAM file, you can use it for downstream analyses. For instance, you can use bowtie2 to map the reads to the reference sequence.

First, download the C. elegans reference sequence.

celegans_refseq = fastaread("https://s3.amazonaws.com/igv.broadinstitute.org/genomes/seq/ce11/ce11.fa");

Save Chromosome 3 reference data in a FASTA file.

celegans_chr3   = celegans_refseq(3).Sequence;
warnState = warning;
warning('off','Bioinfo:fastawrite:AppendToFile'); 
fastawrite("celegans_chr3.fa",celegans_chr3);
warning(warnState);

Build a set of index files using bowtie2build. The status value of 0 means that the build was successful.

status = bowtie2build("celegans_chr3.fa","celegans_chr3_index");

Align read data to the reference. This may take a few minutes.

bowtie2("celegans_chr3_index","SRR11846824_1.fastq","SRR11846824_2.fastq","SRR11846824_mapped.sam");

Create a quality control plot for the SAM file. Note that, for this particular experiment, most of the reads happen to have the same quality score of 30.

seqqcplot("SRR11846824_mapped.sam");

Convert the SAM file to a BAM file. Suppress two informational warnings that are issued while creating a BioMap object.

w = warning;
warning("off","bioinfo:BioMap:BioMap:UnsortedReadsInSAMFile");
warning("off","bioinfo:saminfo:InvalidTagField");
bmObj = BioMap("SRR11846824_mapped.sam");
write(bmObj,"SRR11846824_mapped.bam",Format="BAM");
warning(w);

Visualize the alignment data in the Genomics Viewer app. The corresponding cytoband file is provided with the toolbox.

gv = genomicsViewer(ReferenceFile="celegans_chr3.fa",CytoBand="celegans_cytoBandIdeo.txt.gz");
addTracks(gv,"SRR11846824_mapped.bam");

Use the zoom slider to zoom in and see the features. Or you can enter the following in the search text box: Generated:3,711,861-3,711,940.

You may delete the downloaded files, such as the reference sequence file.

delete celegans_chr3.fa

Close the app.

close(gv);

Input Arguments

collapse all

`accessionNumbers` — Accession run numbers
character vector | string scalar | ...

Accession run numbers, specified as a character vector, string scalar, string vector, or cell array of character vectors.

An accession run number could be in one of these formats: SRR____, ERR____ , or DRR____, which contains actual sequencing data for a particular sequencing experiment. An experiment can contain several runs depending on the number of sequencing instrument runs required. For details about the formats of SRA accession numbers, see Understanding SRA Search Results.

For more information on searching for an SRA accession number, see Search in SRA Entrez.

Example: "SRR1553607"

Data Types: char | string | cell

`SRAFasterqDumpOptions` — `srafasterqdump` options
`SRAFasterqDumpOptions` object | character vector | string scalar

srafasterqdump options, specified as an SRAFasterqDumpOptions object, character vector, or string scalar. The character vector or string scalar must be in the original fasterq-dump option syntax (prefixed by one or two dashes).

Name-Value Arguments

collapse all

Specify optional pairs of arguments as Name1=Value1,...,NameN=ValueN, where Name is the argument name and Value is the corresponding value. Name-value arguments must appear after other arguments, but the order of the pairs does not matter.

Example: tbl = srafasterqdump("SRR000077",FastaOutput=true) specifies to download the FASTA-formatted file.

`AppendOutputFile` — Flag to append new data to the output file
`false` or 0 (default) | `true` or 1

Flag to append new data to the output file instead of overwriting it, specified as a numeric or logical 1 ( true) or 0 (false). By default, the output file is overwritten with new data.

Data Types: double | logical

`ConcatenateReads` — Flag to concatenate sequence information pertaining to each spot
`false` or 0 (default) | `true` or 1

Flag to concatenate sequence information pertaining to each spot, specified as a numeric or logical 1 (true) or 0 (false). By default, the software does not concatenate the information pertaining to each spot. That is, the software writes four lines of FASTQ or two lines of FASTA into one output file for each spot. For details, see FASTQ/FASTA concatenated.

Data Types: double | logical

`ExtraCommand` — Additional commands
`""` (default) | character vector | string scalar

Additional commands, specified as a character vector or string scalar.

The commands must be in the native syntax (prefixed by one or two dashes). Use this option to apply undocumented flags and flags without corresponding MATLAB^® properties.

Example: ExtraCommand="--fasta-ref-tbl --internal-ref"

Data Types: char | string

`FastaOutput` — Flag to save output in FASTA format
`false` or 0 (default) | `true` or 1

Flag to save the output in the FASTA format, specified as a numeric or logical 1 (true) or 0 (false). The default output format is the FASTQ format.

Data Types: double | logical

`FastaOutputUnsorted` — Flag to split sequence information without preserving spot order
`false` or 0 (default) | `true` or 1

Flag to split sequence information pertaining to each spot without preserving the spot order, specified as a numeric or logical 1 (true) or 0 (false).

If the value is true, the software splits the sequence information in each spot is into reads. For each read, two lines of FASTA are written into the single output file. Setting FastaOutputUnsorted=true is the same as setting SplitType=SplitSpot, with the following exceptions:

With FastaOutputUnsorted=true, the original order of the spots and reads is not preserved, and FastaOutputUnsorted name-value argument is exclusively for the FASTA output.
This setting is faster than the SplitSpot option and does not use temporary files.

Data Types: double | logical

`FilterByBases` — String of bases used to filter output
empty string array (default) | string scalar

String of bases used to filter the output, specified as a string scalar. The output is filtered by comparing it to the specified string of bases and keeping reads that include the specified string of bases.

Data Types: string

`IncludeAll` — Flag to include all object properties
`false` or 0 (default) | `true` or 1

Flag to include all object properties with corresponding default values when converting properties to the original option syntax, specified as a numeric or logical 1 (true) or 0 (false). You can convert properties to the original syntax prefixed by one or two dashes (such as '-e 8 --split-file') by using the getCommand function.

When IncludeAll=false and you call getCommand(optionsObject), the software converts only the specified properties. If the value is true, getCommand converts all available properties, using default values for unspecified properties, to the original syntax.

Note

If you set IncludeAll to true, the software converts all available properties, using default values for unspecified properties. The only exception is when the default value of a property is NaN, Inf, [], '', or "". In this case, the software does not translate the corresponding property.

Data Types: logical | double

`IncludeTechnical` — Flag to include technical reads in downloaded files
`false` or 0 (default) | `true` or 1

Flag to include technical reads in the downloaded files, specified as a numeric or logical 1 (true) or 0 (false).

Data Types: double | logical

`MinReadLength` — Minimum length required for read to be included
0 (default) | nonnegative integer

Minimum length required for a read to be included in the output, specified as a nonnegative integer. By default, no read is filtered out.

Data Types: double

`NumThreads` — Number of parallel threads
`6` (default) | positive integer

Number of parallel threads to use, specified as a positive integer. The software runs threads on separate processors or cores. Increasing the number of threads generally improves the runtime significantly, but also increases the memory footprint.

Data Types: double

`OutputDirectory` — Folder where output files are saved
empty string array (default) | character vector | string scalar

Folder where the output files are saved, specified as a character vector or string scalar. By default, the software saves the files in the current directory.

Data Types: char | string

`OutputFileName` — Base name of output files
empty string array (default) | character vector | string scalar

Base name of the output files, specified as a character vector or string scalar. The default base name is the accession run number.

Data Types: char | string

`SplitType` — Method used to split sequence information
`"SplitThree"` (default) | `"SplitFiles"` | `"SplitSpot"`

Method used to split sequence information pertaining to each spot, specified as one of the following:

"SplitThree" — The software splits spots into reads. For each read, the software writes four lines of FASTQ or two lines of FASTA. For spots with two reads, the software produces *_1.fastq and *_2.fastq files. The software places unmated reads in *.fastq. If the accession does not have any spot with one single read, the software does not create a *.fastq file. For details, see FASTQ/FASTA split 3.
"SplitSpot" — The software splits spots into reads. For each read, the software writes four lines of FASTQ or two lines of FASTA. All the reads are saved to a single output file. For details, see FASTQ/FASTA split spot.
"SplitFiles" — The software splits spots into reads. For each read, the software writes four lines of FASTQ or two lines of FASTA. The software assigns each read a number n, where 1 ≤ n ≤ 5, and then saves each nth read to the nth file (*_n.fastq). For details, see FASTQ/FASTA split file.

By default, the reads refer to biological reads only. However, if you set IncludeTechnical to true, then the software also includes the technical reads in the output files.

Data Types: char | string

Output Arguments

collapse all

`outputFileNames` — Names of downloaded files
table

Names of downloaded files, returned as a table. The total number of output files varies depending on the SplitType option and the accession run number. The table can contain one to five columns. The possible column names are: Reads, Reads_1, Reads_2, Reads_3, Reads_4, and Reads_5. Only five of the six column names can appear for a given SplitType option and accession number.

For example, the Reads column could correspond to the single output file produced when you specify SplitType="SplitSpot". The Reads_n columns, where 1 ≤ n ≤ 5, correspond to the output files produced when you specify SplitType="SplitThree" or SplitType="SplitFiles".

More About

collapse all

Biological and Technical Reads

Biological reads are actual sequence data that comes from a biological sample.

Technical reads correspond to technical information, such as adapters, primers, barcodes, and so on. Technical reads are not part of the actual biological sample sequence.

Spot

A spot refers to a location on the flow cell for Illumina^® sequencers. All of the bases for a single location constitute a spot, which includes technical reads. For details, see https://www.biostars.org/p/12047/.

References

[1] SRA Toolkit Development Team https://github.com/ncbi/sra-tools/wiki/01.-Downloading-SRA-Toolkit

Version History

Introduced in R2024a

srafasterqdump

Syntax

Description

Examples

Download NGS Data from SRA

Input Arguments

`accessionNumbers` — Accession run numbers
character vector | string scalar | ...

`SRAFasterqDumpOptions` — `srafasterqdump` options
`SRAFasterqDumpOptions` object | character vector | string scalar

Name-Value Arguments

`AppendOutputFile` — Flag to append new data to the output file
`false` or 0 (default) | `true` or 1

`ConcatenateReads` — Flag to concatenate sequence information pertaining to each spot
`false` or 0 (default) | `true` or 1

`ExtraCommand` — Additional commands
`""` (default) | character vector | string scalar

`FastaOutput` — Flag to save output in FASTA format
`false` or 0 (default) | `true` or 1

`FastaOutputUnsorted` — Flag to split sequence information without preserving spot order
`false` or 0 (default) | `true` or 1

`FilterByBases` — String of bases used to filter output
empty string array (default) | string scalar

`IncludeAll` — Flag to include all object properties
`false` or 0 (default) | `true` or 1

`IncludeTechnical` — Flag to include technical reads in downloaded files
`false` or 0 (default) | `true` or 1

`MinReadLength` — Minimum length required for read to be included
0 (default) | nonnegative integer

`NumThreads` — Number of parallel threads
`6` (default) | positive integer

`OutputDirectory` — Folder where output files are saved
empty string array (default) | character vector | string scalar

`OutputFileName` — Base name of output files
empty string array (default) | character vector | string scalar

`SplitType` — Method used to split sequence information
`"SplitThree"` (default) | `"SplitFiles"` | `"SplitSpot"`

Output Arguments

`outputFileNames` — Names of downloaded files
table

More About

Biological and Technical Reads

Spot

References

Version History

See Also

Topics

srafasterqdump

Syntax

Description

Examples

Download NGS Data from SRA

Input Arguments

accessionNumbers — Accession run numbers character vector | string scalar | ...

SRAFasterqDumpOptions — srafasterqdump options SRAFasterqDumpOptions object | character vector | string scalar

Name-Value Arguments

AppendOutputFile — Flag to append new data to the output file false or 0 (default) | true or 1

ConcatenateReads — Flag to concatenate sequence information pertaining to each spot false or 0 (default) | true or 1

ExtraCommand — Additional commands "" (default) | character vector | string scalar

FastaOutput — Flag to save output in FASTA format false or 0 (default) | true or 1

FastaOutputUnsorted — Flag to split sequence information without preserving spot order false or 0 (default) | true or 1

FilterByBases — String of bases used to filter output empty string array (default) | string scalar

IncludeAll — Flag to include all object properties false or 0 (default) | true or 1

IncludeTechnical — Flag to include technical reads in downloaded files false or 0 (default) | true or 1

MinReadLength — Minimum length required for read to be included 0 (default) | nonnegative integer

NumThreads — Number of parallel threads 6 (default) | positive integer

OutputDirectory — Folder where output files are saved empty string array (default) | character vector | string scalar

OutputFileName — Base name of output files empty string array (default) | character vector | string scalar

SplitType — Method used to split sequence information "SplitThree" (default) | "SplitFiles" | "SplitSpot"

Output Arguments

outputFileNames — Names of downloaded files table

More About

Biological and Technical Reads

Spot

References

Version History

See Also

Topics

`accessionNumbers` — Accession run numbers
character vector | string scalar | ...

`SRAFasterqDumpOptions` — `srafasterqdump` options
`SRAFasterqDumpOptions` object | character vector | string scalar

`AppendOutputFile` — Flag to append new data to the output file
`false` or 0 (default) | `true` or 1

`ConcatenateReads` — Flag to concatenate sequence information pertaining to each spot
`false` or 0 (default) | `true` or 1

`ExtraCommand` — Additional commands
`""` (default) | character vector | string scalar

`FastaOutput` — Flag to save output in FASTA format
`false` or 0 (default) | `true` or 1

`FastaOutputUnsorted` — Flag to split sequence information without preserving spot order
`false` or 0 (default) | `true` or 1

`FilterByBases` — String of bases used to filter output
empty string array (default) | string scalar

`IncludeAll` — Flag to include all object properties
`false` or 0 (default) | `true` or 1

`IncludeTechnical` — Flag to include technical reads in downloaded files
`false` or 0 (default) | `true` or 1

`MinReadLength` — Minimum length required for read to be included
0 (default) | nonnegative integer

`NumThreads` — Number of parallel threads
`6` (default) | positive integer

`OutputDirectory` — Folder where output files are saved
empty string array (default) | character vector | string scalar

`OutputFileName` — Base name of output files
empty string array (default) | character vector | string scalar

`SplitType` — Method used to split sequence information
`"SplitThree"` (default) | `"SplitFiles"` | `"SplitSpot"`

`outputFileNames` — Names of downloaded files
table