seqpdist
Calculate pairwise distance between sequences
Description
Examples
Read amino acid alignment data into a structure.
seqs = fastaread("pf00002.fa");For every possible pair of sequences in the multiple alignment, ignore sites with gaps and score with the scoring matrix PAM250.
dist = seqpdist(seqs,Method="alignment-score", ... Indels="pairwise-delete",... ScoringMatrix="pam250");
Force the realignment of each sequence pair ignoring the provided multiple alignment.
dist = seqpdist(seqs,Method="alignment-score",... Indels="pairwise-delete", ... ScoringMatrix="pam250", ... PairwiseAlignment=1);
Measure the Jukes-Cantor pairwise distances after realigning each sequence pair, counting the gaps as point mutations.
dist = seqpdist(seqs,Method="jukes-cantor", ... Indels="score", ... ScoringMatrix="pam250", ... PairwiseAlignment=true);
Input Arguments
Nucleotide or amino acid sequences, specified as a cell array of character vectors, vector of strings, matrix of characters, or vector of structures.
You can specify:
- Cell array of character vectors or vector of strings containing nucleotide or amino acid sequences. 
- Matrix of characters, in which each row corresponds to a nucleotide or amino acid sequence. 
- Vector of structures containing a - Sequencefield.
Name-Value Arguments
Specify optional pairs of arguments as
      Name1=Value1,...,NameN=ValueN, where Name is
      the argument name and Value is the corresponding value.
      Name-value arguments must appear after other arguments, but the order of the
      pairs does not matter.
    
Example: D = seqpdist(Seqs,Method="p-distance") calculates the
        pairwise distance using the p-distance method.
      Before R2021a, use commas to separate each name and value, and enclose 
      Name in quotes.
    
Example: D = seqpdist(Seqs,"Method","p-distance")
Method to calculate pairwise distances, specified by a character vector or string scalar in the table.
Methods for Nucleotides and Amino Acids
| Method | Description | 
|---|---|
| p-distance | Proportion of sites at which the two sequences are different.
                           d = p | 
| Jukes-Cantor(default) | Maximum likelihood estimate of the number of substitutions between
                        two sequences.  For nucleotides: For amino acids:  | 
| alignment-score | Distance ( d = (1-score12/score11)* (1-score12/score22) d = 0. | 
Methods with No Scoring of Gaps (Nucleotides Only)
| Method | Description | 
|---|---|
| Tajima-Nei | Maximum likelihood estimate considering the background nucleotide frequencies. The  | 
| Kimura | Considers separately the transitional nucleotide substitution and the transversional nucleotide substitution. | 
| Tamura | Considers separately the transitional nucleotide substitution, the transversional nucleotide substitution, and the GC content. The
                           | 
| Hasegawa | Considers separately the transitional nucleotide substitution, the transversional nucleotide substitution, and the background nucleotide frequencies. The  | 
| Nei-Tamura | Considers separately the transitional nucleotide substitution between purines, the transitional nucleotide substitution between pyrimidines, the transversional nucleotide substitution, and the background nucleotide frequencies. The  | 
Methods with No Scoring of Gaps (Amino Acids Only)
| Method | Description | 
|---|---|
| Poisson | Assumes that the number of amino acid substitutions at each site has a Poisson distribution. | 
| Gamma | Assumes that the number of amino acid substitutions at each site
                        has a Gamma distribution with parameter  By
                        default, the  | 
You can also specify a user-defined distance function using @,
              for example, @distfun. The distance function must have the
              form:
function D = distfun(S1, S2, OptionalArgs)
The distfun function takes the following arguments: 
- S1,- S2— Two sequences of the same length (nucleotide or amino acid).
- OptionalArgs— Optional problem-dependent arguments.
The distfun function returns a scalar that represents the
              distance between S1 and
              S2.
Data Types: char | string | function_handle
How to treat sites with gaps, specified as one of these strings or character vectors.
- score— Scores these sites either as a point mutation or with the alignment parameters, depending on the method selected.
- pairwise-del— For every pairwise comparison, it ignores the sites with gaps.
- complete-del— Ignores all the columns in the multiple alignment that contain a gap. This option is available only if you provided a multiple alignment as the input- Seqs.
Data Types: char | string
Optional arguments for the distance method, specified as a numeric scalar or a four-element numeric vector.
Methods with No Scoring of Gaps (Nucleotides Only)
| Method | Description | 
|---|---|
| Tajima-Nei | Background nucleotide frequencies, specified as a 4-element numeric
                      vector of the form [gA gC gG gT]. | 
| Tamura | Proportion of GC content, specified as a number in the range [0, 1]. | 
| Hasegawa | Background nucleotide frequencies, specified as a 4-element numeric
                      vector of the form [gA gC gG gT]. | 
| Nei-Tamura | Background nucleotide frequencies, specified as a 4-element numeric
                      vector of the form [gA gC gG gT]. | 
Methods with No Scoring of Gaps (Amino Acids Only)
| Method | Description | 
|---|---|
| Gamma | Shape parameter aof the Gamma distribution, specified
                      as a numeric scalar. | 
Controls the global pairwise alignment of input sequences, specified as a numeric
              or logical true (1) or false
                (0). When true, the
                seqpdist function performs global pairwise alignment using the
                nwalign function, while ignoring the
              multiple alignment of the input sequences (if any).
The default value depends on the length of the sequences:
- true— When all input sequences do not have the same length.
- false— When all input sequences have the same length.
Tip
If your input sequences are the same length, then
                    seqpdist assumes they are aligned. If they are not aligned,
                  do one of the following:
- Align the sequences before passing them to - seqpdist, for example, using the- multialignfunction.
- Set - PairwiseAlignmentto- truewhen using- seqpdist.
Use parallel computation to calculate the distance, specified as a numeric or
              logical false (0) or true
                (1).
- If - true, and Parallel Computing Toolbox™ is installed, then computation occurs using- parfor-loops.- If a - parpoolis open, then the computation uses the open- parpooland occurs in parallel.
- If there are no open - parpool, but automatic creation is enabled in the Parallel Preferences, then the default pool will be automatically opened and computation occurs in parallel.
- If there are no open - parpooland automatic creation is disabled, then computation uses- parfor-loops in serial mode.
 
- If Parallel Computing Toolbox is not installed, then computation uses - parfor-loops in serial mode.
- If - false, then the computation uses for-loops in serial mode.
Return the output D as a square matrix, specified as a
              numeric or logical false (0) or
                true (1).
When true, the seqpdist function converts
              the output into a square matrix such that
                  D(I,J)Ith and
                Jth sequences. The square matrix is symmetric and has a
              zero diagonal. Setting Squareform to true is the
              same as using the squareform function in Statistics and Machine Learning Toolbox™.
Type of sequence, specified as "AA" (amino acid) or
                "NT" (nucleotide).
Data Types: char | string
Scoring matrix to use for the global pairwise alignment, specified as a character vector, string scalar, or numeric matrix.
You can specify a scoring matrix name. Valid choices are:
- "BLOSUM50"(default for amino acid sequences)
- "NUC44"(default for nucleotide sequences). This choice is not supported for amino acid sequences.
- "BLOSUM62"
- "BLOSUM30"increasing by- 5up to- "BLOSUM90"
- "BLOSUM100"
- "PAM10"increasing by- 10up to- "PAM500"
- "DAYHOFF"
- "GONNET"
Note
The above scoring matrices, provided with the software, also include a scale
                  factor that converts the units of the output score to bits. You can also specify
                  the Scale name-value argument to specify an additional scale
                  factor to convert the output score from bits to another unit.
You can also specify a numeric matrix, such as the one returned by the blosum, pam, dayhoff, gonnet, or nuc44 function. 
Note
- If you use a scoring matrix that you created or was created by one of these scoring matrix functions, the matrix does not include a scale factor. The output score will be returned in the same units as the scoring matrix. You can use the - Scalename-value argument to specify a scale factor to convert the output score to another unit.
- If you need to compile - seqpdistinto a standalone application or software component using MATLAB® Compiler™, use a matrix instead of a character vector or string for- ScoringMatrix.
You can specify this argument only when the Method argument
              is "alignment-score" or the PairwiseAlignment
              argument is true.
Scale factor applied to the output distance, specified as a positive number. By default, there is no scaling or change in the units of the output distance. If the scoring matrix information also provides a scale factor, then both are used.
Use this argument to control the units of the output distance.
You can specify this argument only when the Method argument
              is "alignment-score" or the PairwiseAlignment
              argument is true.
Penalty for opening a gap in the alignment, specified as a positive integer.
You can specify this argument only when the Method argument
              is "alignment-score" or the PairwiseAlignment
              argument is true.
Penalty for extending a gap in the alignment, specified as a positive integer. The
              default is equal to GapOpen.
You can specify this argument only when the Method argument
              is "alignment-score" or the PairwiseAlignment
              argument is true.
Output Arguments
Biological distance between all pairs of sequences stored in the
              M elements of Seqs, returned as a numeric row
            vector or a numeric matrix.
By default, D is a row vector of length
              1-by-(M*(M-1)/2). The elements are arranged in
            the order ((2,1),(3,1),..., (M,1),(3,2),...(M,2),...(M,M-1)). This is
            the lower-left triangle of the full M-by-M
            distance matrix. To get the distance between the Ith and the
              Jth sequences for I > J, use the
            formula D((J-1)*(M-J/2)+I-J).
When you specify the SquareForm name-value argument as
              true, D is the full
              M-by-M distance matrix.
Data Types: double
Extended Capabilities
To run in parallel, set  'UseParallel' to true.
For more information, see the 'UseParallel' name-value pair argument.
Version History
Introduced before R2006a
See Also
fastaread | dnds | dndsml | multialign | nwalign | phytree | seqlinkage | pdist
MATLAB Command
You clicked a link that corresponds to this MATLAB command:
Run the command by entering it in the MATLAB Command Window. Web browsers do not support MATLAB commands.
Select a Web Site
Choose a web site to get translated content where available and see local events and offers. Based on your location, we recommend that you select: .
You can also select a web site from the following list
How to Get Best Site Performance
Select the China site (in Chinese or English) for best site performance. Other MathWorks country sites are not optimized for visits from your location.
Americas
- América Latina (Español)
- Canada (English)
- United States (English)
Europe
- Belgium (English)
- Denmark (English)
- Deutschland (Deutsch)
- España (Español)
- Finland (English)
- France (Français)
- Ireland (English)
- Italia (Italiano)
- Luxembourg (English)
- Netherlands (English)
- Norway (English)
- Österreich (Deutsch)
- Portugal (English)
- Sweden (English)
- Switzerland
- United Kingdom (English)