cuffdiff
Identify significant changes in transcript expression
Syntax
Description
cuffdiff(
        identifies significant changes in transcript expression between the samples in
          transcriptsAnnot,alignmentFiles)alignmentFiles using the transcript annotation file
          transcriptsAnnot
        [1].
cuffdiff requires the Cufflinks Support Package for the Bioinformatics Toolbox™. If the support package is not installed, then the function provides a download
        link. For details, see Bioinformatics Toolbox Software Support Packages.
cuffdiff(
        uses additional options specified by transcriptsAnnot,alignmentFiles,opt)opt.
cuffdiff(
        uses additional options specified by one or more name-value pair arguments. For example,
          transcriptsAnnot,alignmentFiles,Name,Value)cuffdiff('gyrAB.gtf',["Myco_1_1.sam", "Myco_2_1.sam"],'NumThreads',5)
        specifies five parallel threads.
[
        returns the names of files containing differential expression test results using any of the
        input argument combinations in the previous syntaxes. By default, the function saves all
        files to the current directory.isoformsDiff,geneDiff,tssDiff,cdsExp,splicingDiff,cdsDiff,promotersDiff] = cuffdiff(___)
Examples
Create a CufflinksOptions object to define  cufflinks options, such
            as the number of parallel threads and the output directory to store the results.
cflOpt = CufflinksOptions;
cflOpt.NumThreads = 8;
cflOpt.OutputDirectory = "./cufflinksOut";The SAM files provided for this example contain aligned reads for Mycoplasma
                pneumoniae from two samples with three replicates each. The reads are
            simulated 100bp-reads for two genes (gyrA and
            gyrB) located next to each other on the genome. All the reads are
            sorted by reference position, as required by cufflinks.
sams = ["Myco_1_1.sam","Myco_1_2.sam","Myco_1_3.sam",... "Myco_2_1.sam", "Myco_2_2.sam", "Myco_2_3.sam"];
Assemble the transcriptome from the aligned reads.
[gtfs,isofpkm,genes,skipped] = cufflinks(sams,cflOpt);
gtfs is a list of GTF files that contain assembled isoforms.
Compare the assembled isoforms using cuffcompare.
stats = cuffcompare(gtfs);
Merge the assembled transcripts using cuffmerge.
mergedGTF = cuffmerge(gtfs,'OutputDirectory','./cuffMergeOutput');
mergedGTF reports only one transcript. This is because the two
            genes of interest are located next to each other, and cuffmerge
            cannot distinguish two distinct genes. To guide cuffmerge, use a
            reference GTF (gyrAB.gtf) containing information about these two
            genes. If the file is not located in the same directory that you run
                cuffmerge from, you must also specify the file path.
gyrAB = which('gyrAB.gtf'); mergedGTF2 = cuffmerge(gtfs,'OutputDirectory','./cuffMergeOutput2',... 'ReferenceGTF',gyrAB);
Calculate abundances (expression levels) from aligned reads for each sample.
abundances1 = cuffquant(mergedGTF2,["Myco_1_1.sam","Myco_1_2.sam","Myco_1_3.sam"],... 'OutputDirectory','./cuffquantOutput1'); abundances2 = cuffquant(mergedGTF2,["Myco_2_1.sam", "Myco_2_2.sam", "Myco_2_3.sam"],... 'OutputDirectory','./cuffquantOutput2');
Assess the significance of changes in expression for genes and transcripts between
            conditions by performing the differential testing using cuffdiff.
            The cuffdiff function operates in two distinct steps: the function
            first estimates abundances from aligned reads, and then performs the statistical
            analysis. In some cases (for example, distributing computing load across multiple
            workers), performing the two steps separately is desirable. After performing the first
            step with cuffquant, you can then use the binary CXB output file as
            an input to cuffdiff to perform statistical analysis. Because
                cuffdiff returns several files, specify the output directory is
            recommended.
isoformDiff = cuffdiff(mergedGTF2,[abundances1,abundances2],... 'OutputDirectory','./cuffdiffOutput');
Display a table containing the differential expression test results for the two genes
                gyrB and gyrA.
readtable(isoformDiff,'FileType','text')
ans =
  2×14 table
        test_id            gene_id        gene              locus             sample_1    sample_2    status     value_1       value_2      log2_fold_change_    test_stat    p_value    q_value    significant
    ________________    _____________    ______    _______________________    ________    ________    ______    __________    __________    _________________    _________    _______    _______    ___________
    'TCONS_00000001'    'XLOC_000001'    'gyrB'    'NC_000912.1:2868-7340'      'q1'        'q2'       'OK'     1.0913e+05    4.2228e+05          1.9522           7.8886      5e-05      5e-05        'yes'   
    'TCONS_00000002'    'XLOC_000001'    'gyrA'    'NC_000912.1:2868-7340'      'q1'        'q2'       'OK'     3.5158e+05    1.1546e+05         -1.6064          -7.3811      5e-05      5e-05        'yes'   
You can use cuffnorm to generate normalized expression tables for
            further analyses. cuffnorm results are useful when you have many
            samples and you want to cluster them or plot expression levels for genes that are
            important in your study. Note that you cannot perform differential expression analysis
            using cuffnorm.
Specify a cell array, where each element is a string vector containing file names for a single sample with replicates.
alignmentFiles = {["Myco_1_1.sam","Myco_1_2.sam","Myco_1_3.sam"],...
                  ["Myco_2_1.sam", "Myco_2_2.sam", "Myco_2_3.sam"]}
isoformNorm = cuffnorm(mergedGTF2, alignmentFiles,...
                      'OutputDirectory', './cuffnormOutput');
Display a table containing the normalized expression levels for each transcript.
readtable(isoformNorm,'FileType','text')
ans =
  2×7 table
      tracking_id          q1_0          q1_2          q1_1          q2_1          q2_0          q2_2   
    ________________    __________    __________    __________    __________    __________    __________
    'TCONS_00000001'    1.0913e+05         78628    1.2132e+05    4.3639e+05    4.2228e+05    4.2814e+05
    'TCONS_00000002'    3.5158e+05    3.7458e+05    3.4238e+05    1.0483e+05    1.1546e+05    1.1105e+05
Column names starting with q have the format: conditionX_N, indicating that the column contains values for replicate N of conditionX.
Input Arguments
Name of the transcript annotation file, specified as a string or character vector. The file
            can be a GTF or GFF file produced by cufflinks,
                cuffcompare, or another source of GTF annotations.
Example: "gyrAB.gtf"
Data Types: char | string
Names of SAM, BAM, or CXB files containing alignment records for each sample, specified as a string vector or cell array. If you use a cell array, each element must be a string vector or cell array of character vectors specifying alignment files for every replicate of the same sample.
Example: ["Myco_1_1.sam", "Myco_2_1.sam"]
Data Types: char | string | cell
cuffdiff options, specified as a
              CuffDiffOptions object, string, or character vector. The string or
            character vector must be in the original cuffdiff option syntax
            (prefixed by one or two dashes) [1]. 
Name-Value Arguments
Specify optional pairs of arguments as
      Name1=Value1,...,NameN=ValueN, where Name is
      the argument name and Value is the corresponding value.
      Name-value arguments must appear after other arguments, but the order of the
      pairs does not matter.
    
      Before R2021a, use commas to separate each name and value, and enclose 
      Name in quotes.
    
Example: cuffdiff('gyrAB.gtf',["Myco_1_1.sam",
          "Myco_2_1.sam"],'NumThreads',5,'DispersionMethod',"per-condition")
Sample labels, specified as a string, string vector, character
    vector, or cell array of character vectors. The number of labels must equal the number of
    samples or the value must be empty [].
Example: 
              'ConditionLabels',["Control","Mutant1","Mutant2"]
Data Types: string | char | cell
Contrast file name, specified as a string or character vector.
    The file must be a two-column tab-delimited text file, where each line indicates two conditions
    to compare using cuffdiff. The condition labels in the file must match
    either the labels specified for ConditionLabels or the sample names. The
    file must have a single header line as the first line, followed by one line for each contrast.
    An example of the contrast file format follows. 
| condition_A | condition_B | 
|---|---|
| Control | Mutant1 | 
| Control | Mutant2 | 
If you do not provide this file, cuffdiff compares every
    pair of input conditions, which can impact performance.
Example: 
              'ContrastFile',"contrast.txt"
Data Types: char | string
Method to model the variance in fragment counts across replicates, specified as one of the following options:
- "pooled"— The function uses each replicated condition to build a model and averages these models into a global model for all conditions in the experiment.
- "per-condition"— The function produces a model for each condition. You can use this option only if all conditions have replicates.
- "blind"— The function treats all samples as replicates of a single global distribution and produces one model.
- "poisson"— Variance in fragment counts is a poisson model, where the fragment count is predicted to be the mean across replicates. This method is not recommended.
Select a method depending on whether you expect the variability in each group of samples to be similar.
- When comparing two groups where the first group has low cross-replicate variability and the second group has high variability, choose the - per-conditionmethod.
- If the conditions have similar levels of variability, choose the - pooledmethod.
- If you have only a single replicate in each condition, choose the - blindmethod.
Example: 
              'DispersionMethod',"blind"
Data Types: char | string
Flag to perform isoform switching tests, specified as
        true or false. These tests estimate how much
    differential splicing exists in isoforms from a single primary transcript. By default, the value
    is true and the test  results are saved in the output file
        splicing.diff.
Example: 
              'DoIsoformSwitch',false
Data Types: logical
Flag to normalize fragment counts to fragments per kilobase per million mapped reads (FPKM), specified as true or false. 
Example: 
              'EffectiveLengthCorrection',false
Data Types: logical
The commands must be in the native syntax (prefixed by one or two dashes). Use this option to apply undocumented flags and flags without corresponding MATLAB® properties.
Example: 'ExtraCommand','--library-type
              fr-secondstrand'
Data Types: char | string
False discovery rate to use during statistical
              tests, specified as a scalar between 0 and 1. 
Example: 
              'FalseDiscoveryRate',0.01
Data Types: double
Name of the FASTA file with reference transcripts to detect bias in fragment counts, specified as a string or character vector. Library preparation can introduce sequence-specific bias into RNA-Seq experiments. Providing reference transcripts improves the accuracy of the transcript abundance estimates.
Example: 
              'FragmentBiasCorrection',"bias.fasta"
Data Types: char | string
Expected mean fragment length, specified as a positive integer.
    The default value is 200 base pairs. The function can learn the fragment
    length mean for each SAM file. Using this option is not recommended for paired-end reads.
Example: 'FragmentLengthMean',100
Data Types: double
Expected standard deviation for the fragment length
    distribution, specified as a positive scalar. The default value is 80 base
    pairs. The function can learn the fragment length standard deviation for each SAM file. Using
    this option is not recommended for paired-end reads. 
Example: 'FragmentLengthSD',70
Data Types: double
Flag to create differential analysis files
        (*.diff), specified as true or
        false.
Example: 
              'GenerateAnalysisDiff',false
Data Types: logical
Flag to include all the object properties with the
    corresponding default values when converting to the original options syntax, specified as
        true or false. You can convert the properties to the
    original syntax prefixed by one or two dashes (such as '-d 100 -e 80') by
    using getCommand. The
    default value false means that when you call
        getCommand(optionsObject), it converts only the specified properties.
    If the value is true, getCommand converts all available
    properties, with default values for unspecified properties, to the original syntax. 
Note
If you set IncludeAll to true, the software
            converts all available properties, using  default values for unspecified properties. The
            only exception is when the default value of a property is NaN,
                Inf, [], '', or
                "". In this case, the software does not translate the
            corresponding property.
Example: 'IncludeAll',true
Data Types: logical
Minimum number of replicates to test genes for differential regulation, specified as a positive integer. The function skips the tests when the number of replicates is smaller than the specified value.
Example: 
              'IsoformShiftReplicates',2
Data Types: double
Flag to correct by the transcript length, specified as
        true or false. Set this value to
        false only when the fragment count is independent of the feature size,
    such as for small RNA libraries with no fragmentation and for 3' end sequencing, where all
    fragments have the same length.
Example: 'LengthCorrection',false
            
Data Types: logical
Method to normalize the library size, specified as one of the following options:
- "geometric"— The function scales the FPKM values by the median geometric mean of fragment counts across all libraries as described in [2].
- "classic-fpkm"— The function applies no scaling to the FPKM values or fragment counts.
- "quartile"— The function scales the FPKM values by the ratio of upper quartiles between fragment counts and the average value across all libraries.
Example: 
              'LibraryNormalizationMethod',"classic-fpkm"
Data Types: char | string
Name of the GTF or GFF file containing transcripts to ignore during analysis, specified as a string or character vector. Some examples of transcripts to ignore include annotated rRNA transcripts, mitochondrial transcripts, and other abundant transcripts. Ignoring these transcripts improves the robustness of the abundance estimates.
Example: 'MaskFile',"excludes.gtf"
Data Types: char | string
Maximum number of fragments to include for each locus before
    skipping new fragments, specified as a positive integer. Skipped fragments are marked with the
    status HIDATA in the file skipped.gtf.
Example: 'MaxBundleFrags',400000
Data Types: double
Maximum number of aligned reads to include for each fragment
    before skipping new reads, specified as a positive integer. Inf, the default
    value, sets no limit on the maximum number of aligned reads.
Example: 'MaxFragAlignments',1000
Data Types: double
Maximum number of iterations for the maximum likelihood estimation of abundances, specified as a positive integer.
Example: 'MaxMLEIterations',4000
Data Types: double
Minimum number of alignments required in a locus to perform the significance testing for differences between samples, specified as a positive integer.
Example: 
              'MinAlignmentCount',8
Data Types: double
Minimum abundance of an isoform to include in differential
    expression tests, specified as a scalar between 0 and 1.
    For alternative isoforms quantified at below the specified value, the function rounds down the
    abundance to zero. The specified value is a fraction of the major isoform. The function performs
    this filtering after MLE estimation but before MAP estimation to improve the robustness of
    confidence interval generation and differential expression analysis. Using a parameter value
    other than the default is not recommended.
Example: 'MinIsoformFraction',1e-5
Data Types: double
Flag to improve abundance estimation for reads mapped to
    multiple genomic positions using the rescue method, specified as true or
        false. If the value is false, the function divides
    multimapped reads uniformly to all mapped positions. If the value is true,
    the function uses additional information, including gene abundance estimation, inferred fragment
    length, and fragment bias, to improve transcript abundance estimation.
The rescue method is described in [3].
Example: 'MultiReadCorrection',true
            
Data Types: logical
Flag to use only fragments compatible with a reference
    transcript to calculate FPKM values, specified as true or
        false. 
Example: 'NormalizeCompatibleHits',false
Data Types: logical
Flag to include all fragments to calculate FPKM values,
    specified as true or false. If the value is
        true, the function includes all fragments, including fragments without a
    compatible reference.
Example: 'NormalizeTotalHits',true
Data Types: logical
Number of fragment assignments to perform on each transcript, specified as a positive integer. For each fragment drawn from a transcript, the function performs the specified number of assignments probabilistically to determine the transcript assignment uncertainty and to estimate the variance-covariance matrix for the assigned fragment counts.
Example: 'NumFragAssignmentSamples',40
Data Types: double
Number of draws from the negative binomial random number generator for each transcript, specified as a positive integer. Each draw is a number of fragments that the function probabilistically assigns to transcripts in the transcriptome to determine the assignment uncertainty and to estimate the variance-covariance matrix for assigned fragment counts.
Example: 'NumFragSamples',90
Data Types: double
Number of parallel threads to use, specified as a positive integer. Threads are run on separate processors or cores. Increasing the number of threads generally improves the runtime significantly, but increases the memory footprint.
Example: 'NumThreads',4
Data Types: double
Directory to store analysis results, specified as a string or character vector.
Example: 'OutputDirectory',"./AnalysisResults/"
Data Types: char | string
Seed for the random number generator, specified as a nonnegative integer. Setting a seed value ensures the reproducibility of the analysis results.
Example: 'Seed',10
Data Types: double
Flag to treat input samples as a time series rather than as
    independent experimental conditions, specified as true or
        false. If you set the value to true, you must provide
    samples in order of increasing time: the first SAM file must be for the first time point, the
    second SAM file for the second time point, and so on.
Example: 
              'TimeSeries',true
Data Types: logical
Output Arguments
Name of a file containing transcript-level differential expression results, returned as a string.
The output string also includes the directory information defined by
              OutputDirectory. The default is the current directory. If you set
              OutputDirectory to "/local/tmp/", the output
            becomes "/local/tmp/isoform_exp.diff".
Name of a file containing gene-level differential expression results, returned as a string.
The output string also includes the directory information defined by
              OutputDirectory. The default is the current directory. If you set
              OutputDirectory to "/local/tmp/", the output
            becomes "/local/tmp/gene_exp.diff".
Name of a file containing primary transcript differential expression results, returned as a string.
The output string also includes the directory information defined by
              OutputDirectory. The default is the current directory. If you set
              OutputDirectory to "/local/tmp/", the output
            becomes "/local/tmp/tss_group_exp.diff".
Name of a file containing coding sequence differential expression results, returned as a string.
The output string also includes the directory information defined by
              OutputDirectory. The default is the current directory. If you set
              OutputDirectory to "/local/tmp/", the output
            becomes "/local/tmp/cds_exp.diff".
Name of a file containing differential splicing results for isoforms, returned as a string.
The output string also includes the directory information defined by
              OutputDirectory. The default is the current directory. If you set
              OutputDirectory to "/local/tmp/", the output
            becomes "/local/tmp/splicing.diff".
Name of a file containing differential coding sequence output, returned as a string.
The output string also includes the directory information defined by
              OutputDirectory. The default is the current directory. If you set
              OutputDirectory to "/local/tmp/", the output
            becomes "/local/tmp/cds.diff".
Name of a file containing information on differential promoter use that exists between samples, returned as a string.
The output string also includes the directory information defined by
              OutputDirectory. The default is the current directory. If you set
              OutputDirectory to "/local/tmp/", the output
            becomes "/local/tmp/promoters.diff".
References
[1] Trapnell, Cole, Brian A Williams, Geo Pertea, Ali Mortazavi, Gordon Kwan, Marijke J van Baren, Steven L Salzberg, Barbara J Wold, and Lior Pachter. “Transcript Assembly and Quantification by RNA-Seq Reveals Unannotated Transcripts and Isoform Switching during Cell Differentiation.” Nature Biotechnology 28, no. 5 (May 2010): 511–15.
[2] Anders, Simon, and Wolfgang Huber. “Differential Expression Analysis for Sequence Count Data.” Genome Biology 11, no. 10 (October 2010): R106. https://doi.org/10.1186/gb-2010-11-10-r106.
[3] Mortazavi, Ali, Brian A Williams, Kenneth McCue, Lorian Schaeffer, and Barbara Wold. “Mapping and Quantifying Mammalian Transcriptomes by RNA-Seq.” Nature Methods 5, no. 7 (July 2008): 621–28. https://doi.org/10.1038/nmeth.1226.
Version History
Introduced in R2019a
See Also
External Websites
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)