Investigating the Bird Flu Virus
This example shows how to calculate Ka/Ks ratios for eight genes in the H5N1 and H2N3 virus genomes, and perform a phylogenetic analysis on the HA gene from H5N1 virus isolated from chickens across Africa and Asia. For the phylogenetic analysis, you will reconstruct a neighbor-joining tree and create a 3-D plot of sequence distances using multidimensional scaling. Finally, you will map the geographic locations where each HA sequence was found on a regional map. Sequences used in this example were selected from the bird flu case study on the Computational Genomics Website [1]. Note: The final section in this example requires the Mapping Toolbox™.
Introduction
There are three types of influenza virus: Type A, B and C. All influenza genomes are comprised of eight segments or genes that code for polymerase B2 (PB2), polymerase B1 (PB1), polymerase A (PA), hemagglutinin (HA), nucleoprotein (NP), neuraminidase (NA), matrix (M1), and non-structural (NS1) proteins. Note: Type C virus has hemagglutinin-esterase (HE), a homolog to HA.
Of the three types of influenza, Type A has the potential to be the most devastating. It affects birds (its natural reservoir), humans and other mammals and has been the major cause of global influenza epidemics. Type B affects only humans causing local epidemics, and Type C does not tend to cause serious illness.
Type A influenzas are further classified into different subtypes according to variations in the amino acid sequences of HA (H1-16) and NA (N1-9) proteins. Both proteins are located on the outside of the virus. HA attaches the virus to the host cell then aids in the process of the virus being fused in to the cell. NA clips the newly created virus from the host cell so it can move on to a healthy new cell. Difference in amino acid composition within a protein and recombination of the various HA and NA proteins contribute to Type A influenzas' ability to jump host species (i.e. bird to humans) and wide range of severity. Many new drugs are being designed to target HA and NA proteins [2,3,4].
In 1997, H5N1 subtype of the avian influenza virus, a Type A influenza virus, made an unexpected jump to humans in Hong Kong causing the deaths of six people. To control the rapidly spreading disease, all poultry in Hong Kong was destroyed. Sequence analysis of the H5N1 virus is shown here [2,4].
Calculate Ka/Ks Ratio For Each H5N1 Gene
An investigation of the Ka/Ks ratios for each gene segment of the H5N1 virus will provide some insight into how each is changing over time. Ka/Ks is the ratio of non-synonymous changes to synonymous in a sequence. For a more detailed explanation of Ka/Ks ratios, see Analyzing Synonymous and Nonsynonymous Substitution Rates. To calculate Ka/Ks, you need a copy of the gene from two time points. You can use H5N1 virus isolated from chickens in Hong Kong in 1997 and 2001. For comparison, you can include H2N3 virus isolated from mallard ducks in Alberta in 1977 and 1985 [1].
For the purpose of this example, sequence data is provided in four MATLAB® structures that were created by genbankread
.
Load H5N1 and H2N3 sequence data.
load('birdflu.mat','chicken1997','chicken2001','mallard1977','mallard1985')
Data in public repositories is frequently curated and updated. You can retrieve the up-to-date datasets by using the getgenbank
function. Note that if data has indeed changed, the results of this example might be slightly different when you use up-to-date datasets.
chicken1997 = arrayfun(@(x)getgenbank(x{:}),{chicken1997.Accession}); chicken2001 = arrayfun(@(x)getgenbank(x{:}),{chicken2001.Accession}); mallard1977 = arrayfun(@(x)getgenbank(x{:}),{mallard1977.Accession}); mallard1985 = arrayfun(@(x)getgenbank(x{:}),{mallard1985.Accession});
You can extract just the coding portion of the nucleotide sequences using the featureparse
function. The featureparse
function returns a structure with fields containing information from the Features section in a GenBank file including with a Sequence field that contains just the coding sequence.
for ii = 1:numel(chicken1997) ntSeq97{ii} = featureparse(chicken1997(ii),'feature','cds','sequence',true); ntSeq01{ii} = featureparse(chicken2001(ii),'feature','cds','sequence',true); ntSeq77{ii} = featureparse(mallard1977(ii),'feature','cds','sequence',true); ntSeq85{ii} = featureparse(mallard1985(ii),'feature','cds','sequence',true); end ntSeq97{1}
ans = struct with fields: Location: '<1..>2273' Indices: [1 2273] UnknownFeatureBoundaries: 1 gene: 'PB2' codon_start: '1' product: 'PB2 protein' protein_id: 'AAF02361.1' db_xref: 'GI:6048850' translation: 'RIKELRDLMSQSRTREILTKTTVDHMAIIKKYTSGRQEKNPALRMKWMMAMKYPITADKRIMEMIPERNEQGQTLWSKTNDAGSDRVMVSPLAVTWWNRNGPTTSTVHYPKVYKTYFEKVERLKHGTFGPVHFRNQVKIRRRVDMNPGHADLSAKEAQDVIMEVVFPNEVGARILTSESQLTITKEKREELKNCNISPLMVAYMLERELVRKTRFLPVAGGTSSVYIEVLHLTQGTCWEQMYTPGGEVRNDDVDQSLIIAARNIVRRATVSADPLASLLEMCHSTQIGGVRMVDILKQNPTEEQAVDICKAAMGLRISSSFSFGGFTFKRTKGFSVKREEEVLTGNLQTLKIQVHEGYEEFTMVGRRATAILRKATRRMIQLIVSGRDEQSIAEAIIVAMVFSQEDCMIKAVRGDLNFVNRANQRLNPMHQLLRHFQKDAKVLFQNWGIEPIDNVMGMIGILPDMTPSTEMSLRGVRVSKMGVDEYSSTERVVVSIDRFLRVRDQQGNVLLSPEEVSETQGMEKLTITYSSSMMWEINGPESVLVNTYQWIIRNWETVKIQWSQEPTMLYNKMEFEPFQSLVPKAARSQYSGFVRTLFQQMRDVLGTFDTVQIIKLLPFAAAPPEQSRMQFSSLTVNVRGSGMRILVRGNSPAFNYNKTTKRLTILGKDAGAITEDPDEGAAGVESAVLRGFLILGKEDKRYGPALSINELSNLTKGEKANVLIGQGDVVLVMKRKRDSSILTDSQTATKRIRMAIN' Sequence: 'agaataaaagaactaagagatttgatgtcgcaatctcgcacacgcgagatactgacaaaaaccactgtggatcatatggccataattaagaagtacacatcaggaagacaggagaagaaccccgctcttagaatgaaatggatgatggcgatgaaatacccgatcacagctgacaaaagaataatggagatgatccctgaaaggaatgagcaaggtcaaactctttggagcaaaacaaatgacgctggatcagacagggtaatggtatcacctctggctgtaacgtggtggaacagaaatggaccaacaacaagtacagtccattatccaaaggtgtataaaacctactttgaaaaggttgaaagattaaaacacggaacctttggccctgttcatttccggaatcaagtcaaaatacgccgcagggttgacatgaaccctggccatgcagatctcagcgctaaagaagcacaagatgtcatcatggaggtcgttttcccaaatgaagttggagccaggatattgacatcagagtcacagctgacaataacaaaggaaaagagggaggaactcaagaattgtaatatttctcctttaatggtggcatatatgttggaaagagaattggttcgcaagaccagattcctaccagtggctggtgggacaagcagcgtatatatagaagtattgcatttgacccaaggaacctgctgggagcagatgtacacaccaggaggggaggtaagaaatgatgatgttgaccaaagtttaatcattgctgctaggaacattgtcaggagagcaacagtatcagcagacccattggcttcactcttggagatgtgccatagcacacaaattggcggagtaagaatggtagacatccttaaacaaaacccaacagaagagcaagctgtagatatatgcaaggcagcaatgggtttaagaatcagctcatccttcagctttggagggttcactttcaaaagaacaaaggggttttctgtcaaaagagaggaagaagtgcttacaggcaacctccaaacattgaagatacaagtacatgaaggatatgaggaattcacaatggttggacgaagagcaacagccattctaagaaaagcaaccagaaggatgatccaactgatagtcagcgggagggacgagcaatcaattgctgaggcaattattgtagcaatggtgttctcacaagaagattgcatgataaaggcagtccgaggtgatttgaatttcgtaaacagagcaaatcaacgactgaaccccatgcaccaactcctgagacacttccaaaaggatgcaaaggtgctgtttcaaaactggggaattgaaccaatcgacaatgtcatggggatgattggaatattgcctgacatgacccccagcacggaaatgtcactaagaggagtgagagttagtaaaatgggggtggatgaatattctagcactgaaagagtggtcgtgagcattgaccgtttcttaagggtccgagatcagcaaggaaatgtactcctatcccctgaagaagttagtgagacacagggaatggaaaagttgacgataacttattcatcgtctatgatgtgggaaattaacggcccagaatcagttctagttaacacataccaatggatcattaggaattgggagactgtaaagatccaatggtcccaagaacccaccatgctatacaataagatggagtttgaaccatttcaatctttagtaccaaaggctgccagaagccaatatagtggatttgtgagaacgctattccagcagatgcgtgatgttttgggaacatttgacactgttcaaataatcaaactactaccatttgcagcagccccacctgaacagagtaggatgcaattttcttctctgactgtgaatgtgaggggatcaggaatgagaatacttgtgagaggtaactcccctgcgtttaactacaacaagacaactaagaggcttacaatacttgggaaggacgcaggtgcaattacagaggacccagatgaaggagcagcaggggtagagtctgcagtattgagagggtttctaattctcggcaaagaagacaaaagatatggaccagcattgagcatcaatgaactgagcaatcttacgaaaggggagaaagctaatgtattgatagggcaaggagacgtggtgttggtaatgaaacggaaacgggactctagcatacttactgacagccagacagcgaccaaaagaattcggatggccatcaatta'
Visual inspection of the sequence structures revealed some of the genes have splice variants represented in the GenBank files. Because this analysis is only on PB2, PB1, PA, HA, NP, NA, M1, and NS1 genes, you need to remove any splice variants.
Remove splice variants from 1997 H5N1
ntSeq97{7}(1) = [];% M2 ntSeq97{8}(1) = [];% NS2
Remove splice variants from 1977 H2N3
ntSeq77{2}(2) = [];% PB1-F2 ntSeq77{7}(1) = [];% M2 ntSeq77{8}(1) = [];% NS2
Remove splice variants from 1985 H2N3
ntSeq85{2}(2) = [];% PB1-F2 ntSeq85{7}(1) = [];% M2 ntSeq85{8}(1) = [];% NS2
You need to align the nucleotide sequences to calculate the Ka/Ks ratio. Align protein sequences for each gene (available in the 'translation' field) using nwalign
function, then insert gaps into nucleotide sequence using seqinsertgaps
. Use the function dnds
to calculate non-synonymous and synonymous substitution rates for each of the eight genes in the virus genomes. If you are interested in seeing the sequence alignments, set the 'verbose' option to true when using dnds
.
Influenza gene names
proteins = {'PB2','PB1','PA','HA','NP','NA','M1','NS1'};
H5N1 Virus
for ii = 1:numel(ntSeq97) [sc,align] = nwalign(ntSeq97{ii}.translation,ntSeq01{ii}.translation,'alpha','aa'); ch97seq = seqinsertgaps(ntSeq97{ii}.Sequence,align(1,:)); ch01seq = seqinsertgaps(ntSeq01{ii}.Sequence,align(3,:)); [dn,ds] = dnds(ch97seq,ch01seq); H5N1.(proteins{ii}) = dn/ds; end
H2N3 Virus
for ii = 1:numel(ntSeq77) [sc,align] = nwalign(ntSeq77{ii}.translation,ntSeq85{ii}.translation,'alpha','aa'); ch77seq = seqinsertgaps(ntSeq77{ii}.Sequence,align(1,:)); ch85seq = seqinsertgaps(ntSeq85{ii}.Sequence,align(3,:)); [dn,ds] = dnds(ch77seq,ch85seq); H2N3.(proteins{ii}) = dn/ds; end H5N1 H2N3
H5N1 = struct with fields: PB2: 0.0226 PB1: 0.0240 PA: 0.0307 HA: 0.0943 NP: 0.0517 NA: 0.1015 M1: 0.0460 NS1: 0.3010 H2N3 = struct with fields: PB2: 0.0048 PB1: 0.0021 PA: 0.0089 HA: 0.0395 NP: 0.0071 NA: 0.0559 M1: 0 NS1: 0.1954
Note: Ka/Ks ratio results may vary from those shown on [1] due to sequence splice variants.
Visualize Ka/Ks ratios in 3-D bar graph.
H5N1rates = cellfun(@(x)(H5N1.(x)),proteins); H2N3rates = cellfun(@(x)(H2N3.(x)),proteins); bar3([H2N3rates' H5N1rates']); ax = gca; ax.XTickLabel = {'H2N3','H5N1'}; ax.YTickLabel = proteins; zlabel('Ka/Ks'); view(-115,16); title('Ka/Ks Ratios for H5N1 (Chicken) and H2N3 (Mallard) Viruses');
NS1, HA and NA have larger non-synonymous to synonymous ratios compared to the other genes in both H5N1 and H2N3. Protein sequence changes to these genes have been attributed to an increase in H5N1 pathogenicity. In particular, changes to the HA gene may provide the virus the ability to transfer into others species beside birds [2,3].
Perform a Phylogenetic Analysis of the HA Protein
The H5N1 virus attaches to cells in the gastrointestinal tract of birds and the respiratory tract of humans. Changes to the HA protein, which helps bind the virus to a healthy cell and facilitates its incorporation into the cell, are what allow the virus to affect different organs in the same and different species. This may provide it the ability to jump from birds to humans [2,3]. You can perform a phylogenetic analysis of the HA protein from H5N1 virus isolated from chickens at different times (years) in different regions of Asia and Africa to investigate their relationship to each other.
Load HA amino acid sequence data from 16 regions/times from the MAT-file provided birdflu.mat
or retrieve the up-to-date sequence data from the NCBI repository using the getgenpept
function.
load('birdflu.mat','HA')
HA = arrayfun(@(x)getgenpept(x{:}),{HA.Accession});
Create a new structure array containing fields corresponding to amino acid sequence (Sequence) and source information (Header). You can extract source information from the HA using featureparse
then parse with regexp
.
for ii = 1:numel(HA) source = featureparse(HA(ii),'feature','source'); strain = regexp(source.strain,'A/[Cc]hicken/(\w+\s*\w*).*/(\d+)','tokens'); proteinHA(ii).Header = sprintf('%s_%s',char(strain{1}(1)),char(strain{1}(2))); proteinHA(ii).Sequence = HA(ii).Sequence; end proteinHA(1)
ans = struct with fields: Header: 'Nigeria_2006' Sequence: 'mekivllfaivslvksdqicigyhannsteqvdtimeknvtvthaqdilekthngklcdldgvkplilrdcsvagwllgnpmcdeflnvpewsyivekinpandlcypgnfndyeelkhllsrinhfekiqiipksswsdheassgvssacpyqgrssffrnvvwlikkdnayptikrsynntnqedllvlwgihhpndaaeqtrlyqnpttyisvgtstlnqrlvpkiatrskvngqsgrmeffwtilkpndainfesngnfiapenaykivkkgdstimkseleygncntkcqtpigainssmpfhnihpltigecpkyvksnrlvlatglrnspqgerrrkkrglfgaiagfieggwqgmvdgwygyhhsneqgsgyaadkestqkaidgvtnkvnsiidkmntqfeavgrefnnlerrienlnkkmedgfldvwtynaellvlmenertldfhdsnvknlydkvrlqlrdnakelgngcfefyhrcdnecmesvrngtydypqyseearlkreeisgvklesigtyqilsiystvasslalaimvaglslwmcsngslqcrici'
Align the HA amino acid sequences using multialign
and visualize the alignment with seqalignviewer
.
alignHA = multialign(proteinHA); seqalignviewer(alignHA);
Calculate the distances between sequences using seqpdist
with the Jukes-Cantor method. Use seqneighjoin
to reconstruct a phylogenetic tree using the neighbor-joining method. Seqneighjoin
returns a phytree
object.
distHA = seqpdist(alignHA,'method','Jukes-Cantor','alpha','aa'); HA_NJtree = seqneighjoin(distHA,'equivar',alignHA);
Use the view
method associated with phytree
objects to open the tree in the Phylogenetic Tree Tool.
view(HA_NJtree);
Visualize Sequence Distances with Multidimensional Scaling (MDS)
Another way to visualize the relationship between sequences is to use multidimensional scaling (MDS) with the distances calculated for the phylogenetic tree. This functionality is provided by the cmdscale
function in Statistics and Machine Learning Toolbox™.
[Y,eigvals] = cmdscale(distHA);
You can use the eigenvalues returned by cmdscale
to help guide your decision of whether to use the first two or three dimensions in your plot.
sigVecs = [1:3;eigvals(1:3)';eigvals(1:3)'/max(abs(eigvals))]; report = ['Dimension Eigenvalues Normalized' ... sprintf('\n %d\t %1.4f %1.4f',sigVecs)]; display(report);
report = 'Dimension Eigenvalues Normalized 1 0.0062 1.0000 2 0.0028 0.4462 3 0.0014 0.2209'
The first two dimensions represent a large portion of the data, but the third still contains information that might help resolve clusters in the sequence data. You can create a three dimensional scatter plot using plot3
function.
locations = {proteinHA(:).Header}; figure plot3(Y(:,1),Y(:,2),Y(:,3),'ok'); text(Y(:,1)+0.002,Y(:,2),Y(:,3)+0.001,locations,'interpreter','no'); title('MDS Plot of HA Sequences'); view(-21,12);
Clusters appear to correspond to groupings in the phylogenetic tree. Find the sequences belonging to each cluster using the subtree
method of phytree
. One of subtree
's required inputs is the node number (number of leaves + number of branches), which will be the new subtree's root node. For your example, the cluster containing Hebei and Hong Kong in the MDS plot is equivalent to the subtree whose root node is Branch 14, which is Node 30 (16 leaves + 14 branches).
cluster1 = get(subtree(HA_NJtree,30),'LeafNames'); cluster2 = get(subtree(HA_NJtree,21),'LeafNames'); cluster3 = get(subtree(HA_NJtree,19),'LeafNames');
Get an index for the sequences belonging to each cluster.
[cl1,cl1_ind] = intersect(locations,cluster1); [cl2,cl2_ind] = intersect(locations,cluster2); [cl3,cl3_ind] = intersect(locations,cluster3); [cl4,cl4_ind] = setdiff(locations,{cl1{:} cl2{:} cl3{:}});
Change the color and marker symbols on the MDS plot to correspond to each cluster.
h = plot3(Y(cl1_ind,1),Y(cl1_ind,2),Y(cl1_ind,3),'^',... Y(cl2_ind,1),Y(cl2_ind,2),Y(cl2_ind,3),'o',... Y(cl3_ind,1),Y(cl3_ind,2),Y(cl3_ind,3),'d',... Y(cl4_ind,1),Y(cl4_ind,2),Y(cl4_ind,3),'v'); numClusters = 4; col = autumn(numClusters); for i = 1:numClusters h(i).MarkerFaceColor = col(i,:); end set(h(:),'MarkerEdgeColor','k'); text(Y(:,1)+0.002,Y(:,2),Y(:,3),locations,'interpreter','no'); title('MDS Plot of HA Sequences'); view(-21,12);
For more detailed information on using Ka/Ks ratios, phylogenetics and MDS for sequence analysis, see Cristianini and Hahn [5].
Optional: Display Geographic Regions of the H5N1 Virus on a Map of Africa and Asia
If you have Mapping toolbox, you can run the following code to generate a figure that shows the geographic locations of the virus.
Create a geostruct structure, regionHA
, that contains the geographic information for each feature, or sequence, to be displayed. A geostruct is required to have Geometry, Lat, and Lon fields that specify the feature type, latitude and longitude. This information is used by mapping functions in Mapping Toolbox to display geospatial data.
[regionHA(1:16).Geometry] = deal('Point'); [regionHA(:).Lat] = deal(9.10, 34.31, 15.31, 39.00, 39.00, 39.00, 55.26,... 15.56, 34.00, 33.14, 34.20, 23.00, 37.35, 44.00,... 22.11, 22.11); [regionHA(:).Lon] = deal(7.10, 69.08, 32.35, 116.00, 116.00, 116.00,... 65.18, 105.48, 114.00, 131.36, 131.40, 113.00,... 127.00, 127.00, 114.14, 114.14);
A geostruct can also have attribute fields that contain additional information about each feature. Add attribute fields Name and Cluster to the regionHA
structure. The Cluster field contains the sequence's cluster number, which you will use to identify the sequences' cluster membership.
[regionHA(:).Name] = deal(proteinHA.Header); [regionHA(cl1_ind).Cluster] = deal(1); [regionHA(cl2_ind).Cluster] = deal(2); [regionHA(cl3_ind).Cluster] = deal(3); [regionHA(cl4_ind).Cluster] = deal(4);
Create a structure using the makesymbolspec
function, which will contain marker and color specifications for each marker to be displayed on the map. You will pass this structure to the geoshow
function. Symbol markers and colors are set to correspond with the clusters in MDS plot.
clusterSymbols = makesymbolspec('Point',... {'Cluster',1,'Marker', '^'},... {'Cluster',2,'Marker', 'o'},... {'Cluster',3,'Marker', 'd'},... {'Cluster',4,'Marker', 'v'},... {'Cluster',[1 4],'MarkerFaceColor',autumn(4)},... {'Default','MarkerSize', 6},... {'Default','MarkerEdgeColor','k'});
Load the mapping information and use the geoshow
function to plot virus locations on a map.
load coastlines load topo figure fig = gcf; worldmap([-45 85],[0 160]) setm(gca,'mapprojection','robinson',... 'plabellocation',30,'mlabelparallel',-45,'mlabellocation',30) geoshow(coastlat,coastlon) geoshow(topo, topolegend, 'DisplayType', 'texturemap') demcmap(topo) brighten(.60) geoshow(regionHA,'SymbolSpec',clusterSymbols); title('Geographic Locations of HA Sequence in Africa and Asia')
close all
References
[1] https://computationalgenomics.blogs.bristol.ac.uk/case_studies/birdflu_demo
[2] Laver, W.G., Bischofberger, N. and Webster, R.G., "Disarming Flu Viruses", Scientific American, 280(1):78-87, 1999.
[3] Suzuki, Y. and Masatoshi, N., "Origin and Evolution of Influenza Virus Hemagglutinin Genes", Molecular Biology and Evolution, 19(4):501-9, 2002.
[4] Gambaryan, A., et al., "Evolution of the receptor binding phenotype of influenza A(H5) viruses", Virology, 344(2):432-8, 2006.
[5] Cristianini, N. and Hahn, M.W., "Introduction to Computational Genomics: A Case Studies Approach", Cambridge University Press, 2007.
[6] Google Earth images were acquired using Google Earth Pro. For more information about Google Earth and Google Earth Pro, visit http://earth.google.com/